Informed decisions about every aspect of life—career and job seeking, housing, public health, energy, transportation, food supplies, crime prevention, commerce, and any other area one can think of—rely on credible, accurate, objective, and relevant data. Historically, countries have responded to this need by assigning central governments the responsibility for producing statistics (Davies et al., 2019). These data provide the foundation for basic and applied social-, behavioral-, and economic-sciences research, which helps research, policy-analysis, and program-evaluation communities understand and make informed decisions regarding the economy and society. A country’s data infrastructure is analogous to the bridges and highways that make up the physical transportation infrastructure necessary for commerce. Like other types of infrastructure, these essential statistics can easily be taken for granted until they fail to meet the needs of individuals or society in some way.
This report is the first in a series intended to help build a vision for a new data infrastructure for the common good. This report describes the need for a new national data infrastructure, presents an initial vision, and describes the expected outcomes and key attributes of such an infrastructure. The report also discusses the implications of blending data from multiple sources as well as the organizational implications of cross-sector data access and use. The report concludes by identifying short- and medium-term activities to facilitate progress toward the full vision. Future reports will explore associated topics in greater depth, including case studies and implications of blending multiple sources, data equity, and other data infrastructure-related challenges and opportunities.
Several experts and organizations, including the U.S. Commission on Evidence-Based Policymaking (CEP), have recognized threats to the current data infrastructure, the necessity of strengthening that infrastructure, and the opportunities for doing so. In recent years, the Committee on National Statistics (CNSTAT), a standing board of the National Academies of Sciences, Engineering, and Medicine, has initiated and overseen work that could enhance federal statistics by blending and integrating a variety of administrative and other data sources (the National Academies of Sciences, Engineering, and Medicine, 2017a,b). These studies explored the growing concern that participation in federal sample-survey data collections, the archetypal method of measurement, has continually decreased over the past 15, or more, years (e.g., U.S. Bureau of Labor Statistics, 2022). If participation is dominated by a particular subgroup of respondents low-response-rate surveys have an increased risk of biased statistics (Czajka and Beyler, 2016; Groves and Peytcheva, 2008). Moreover, uneven spatial patterns of responses can give data from certain locations greater uncertainty, potentially resulting in the misallocation of critical federal resources.
The National Academies’ studies highlighted the nearly singular reliance on survey data for federal statistics, noting that “surveys and censuses are currently the principal means of collecting federal statistics. The Census Bureau alone conducts more than 130 economic and demographic surveys every year” (the National Academies, 2017b, p. 22). As survey response rates have decreased, alternative data sources (transactional, geospatial, scanner, and sensor) have become increasingly available to blend with other sources, and the National Academies’ reports also offered guidance to federal agencies regarding technical solutions for working with alternative data sources.
Similarly, CEP reported that “household survey data collection programs, including key U.S. Census Bureau programs, are finding it more difficult to obtain accurate income data from the survey population” (Commission on Evidence-Based Policymaking, 2017, p. 5). In the same report, CEP called for a space in which data from multiple sources could be blended, while protecting privacy and confidentiality:
The Congress and the President should enact legislation establishing the National Secure Data Service (NSDS) to facilitate data access for evidence building while ensuring transparency and privacy. The NSDS should model best practices for secure record linkage and drive the implementation of innovative privacy-enhancing technologies (Commission on Evidence-Based Policymaking, 2017, p. 5).
To build on these efforts, CNSTAT has been considering several issues, including how to advance the vision of NSDS, how the statistical system should adapt to an increasingly digitized world, and how to build a vision for a future data infrastructure. To complement the efforts of the Foundations for Evidence-Based Policymaking Act of 2018 to implement CEP’s recommendations, CNSTAT focused on data sources not currently within the scope of other efforts, such as state, tribal, territorial, and local government data, as well as private sector data. In 2019, CNSTAT formulated the following definition of data infrastructure to guide future work:1
The data infrastructure consists of data assets; the technologies used to discover, access, share, process, use, analyze, manage, store, preserve, protect, and secure those assets; the people, capacity, and expertise needed to manage, use, interpret, and understand data; the guidance, standards, policies, and rules that govern data access, use, and protection; the organizations and entities that manage, oversee, and govern the data infrastructure; and the communities and data subjects whose data is shared and used for statistical purposes and may be impacted by decisions that are made using those data assets.
This report is the first of three targeted consensus reports by separate panels exploring specific aspects of a 21st century national data infrastructure. Each consensus panel will convene a public workshop as its primary vehicle for external fact-gathering. Cumulatively, the three panels and their respective reports are intended to contribute to a vision for a 21st century national data infrastructure for federal statistics that will support social and economic research into the future.2 The project, overseen by CNSTAT, is funded by the National Science Foundation. The project’s statement of task can be seen in Box 1-1. The tasks of this panel—the Panel on the Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure—are outlined in this report, Report 1.
The first paragraph in Box 1-1 bears particular relevance. The vision outlined in this and subsequent reports is for a new data infrastructure. Reports 2 and 3 will explore aspects of this vision including case studies and implications of blending multiple data sources; data equity; and other relevant data infrastructure issues, challenges, and opportunities that are identified during each panel’s deliberations. The existing data ecosystem is evolving rapidly, and the goal of Report 3 is to respond
1 This definition was created for internal use and was presented at the 2020 CNSTAT Retreat based on a paper (entitled “A Suggested Framework for Discussion”) by Tom Mesenbourg.
2 For more information, see: https://www.nationalacademies.org/our-work/toward-a-vision-for-a-new-data-infrastructure-for-federal-statistics-and-social-and-economic-research-in-the-21st-century
to these changes and focus on technical issues deemed most relevant to implementing and operationalizing various components of a potential new data infrastructure.
INTERPRETATION OF THE CHARGE
The panel’s fundamental focus was improving federal statistics, with the recognition that increasing the availability and utility of all relevant data could not only improve said statistics but also support research and evidence-building for the public good. As a result, in formulating its vision, the panel looked beyond federally controlled and directed data assets as specified in the Evidence Act.
The panel’s vision for a new data infrastructure includes all relevant data assets held by federal, state, tribal, territory, and local governments; the private sector; nonprofit and academic institutions; as well as crowdsourced and citizen-science organizations. Besides federal statistical and administrative data assets, process-related, sensor/monitoring, and other data assets could be included if they are consistent with data infrastructure purposes. While the scope of potential data assets is expansive, this report focuses on data assets that can improve social and economic statistics and research. Consequently, this report does not examine issues related to using sensor data. Nevertheless, the panel foresees the possible integration of sensor data into national statistics once techniques exist to accurately assess their fitness for use and to address important security and privacy issues (U.S. Government Accountability Office, 2020). Subsequent workshops may focus on sensor, monitor, process, and other alternative data assets.
The panel examined evidence to inform key components and characteristics of a new data infrastructure and determined that challenges and opportunities to its creation include not only issues of data infrastructure governance but also other important considerations that warrant individual attention. These attributes (as the panel labeled them) include safeguards for data subjects and holders, legal and regulatory issues that directly impact the ability of infrastructure participants to share data, and transparency of infrastructure processes and practices. These attributes comprise part of the broader ecosystem in which a new data infrastructure would operate.
This report does not examine the logical, physical, or technical architecture of new infrastructure, nor does it describe specific technical capabilities related to data formats or metadata, encryption or security protocols, access controls, or organizational functions and responsibilities. The panel focused on a high-level vision, identifying the components and key characteristics of a 21st century national data infrastructure, including governance; the required capabilities, techniques, and methods; and the data assets that could be shared. This report describes how the United States can improve the
statistical information so critical to shaping the nation’s future, by mobilizing data assets and blending them with existing survey data. Future work should examine whether new data infrastructures should be designed in a federated manner or through a centralized data intermediary.
The composition of the panel reflects a focus on social and economic research as an integral part of a new data infrastructure. Note that the physical or natural sciences are not mentioned in the charge. While research techniques from computer science and engineering, for example, are commonly deployed in the social sciences, primary data from the physical and natural sciences are rarely used in the creation of national statistics describing the social and economic status of the United States.
EVIDENCE BASE FOR REPORT
In executing its charge, the panel sought evidence from wide and disparate sources, including groups of experts that previously identified obstacles to the blending of data to improve national statistics. In addition to CEP and the 2017 National Academies’ reports mentioned above, the panel closely followed the Advisory Committee on Data for Evidence Building (ACDEB), which was founded based on a recommendation in the CEP report (Commission on Evidence-Based Policymaking, 2017). ACDEB is tasked with “assisting the Director of the Office of Management and Budget on issues of access to data and providing recommendations on how to facilitate data sharing, data linkage, and privacy-enhancing techniques” (U.S. Department of Commerce, 2022). Similar to the CEP report, ACDEB is not actively researching the utility of blending private sector data with data assets from the federal statistical system.
The panel also examined peer-reviewed papers, white papers, and United States (as well as non-U.S.) government policies and planning documents, but found no individuals or groups pursuing a new national data infrastructure integrating multiple sources as envisioned by the panel.3 The panel aims to advise the organization of these data at a national scale, to improve national statistics. Combining these data will only be possible within a suitable legal, governance, and organizational structure that delineates services, responsibilities, and access procedures while improving current privacy and confidentiality protections. Thus, the panel describes current impediments and considers the appropriate attributes that will enable a successful transformation to a new data infrastructure.
3 Two earlier, successful government-led initiatives to integrate data are the Longitudinal Employer-Household Dynamics program (https://lehd.ces.census.gov/) and the Criminal Justice Administrative Records System (https://cjars.isr.umich.edu/). These initiatives fulfilled important sectoral goals rather than demonstrating a federal statistical system approach.
To address gaps in the literature and to provide a forum for public comment, the panel convened two public workshops in December 2021 (hereafter, referred to as “the workshops”). The workshops were organized into five sessions:
- Data Infrastructure Initiatives—Description and Discussion;
- Private Sector Data Uses for National Statistical Purposes—International Perspectives;
- Federal Statistical Agencies Uses of Private Sector Transaction Data;
- Federal Statistical Agencies’ and Nonprofits’ Use of Private Sector Health Data; and
- Perspectives on Using Private Sector Data for Official Statistics and Research.
The utility of private sector data to improve national statistics was a focus of most sessions, as these data have not been included in other efforts (see Appendix B for complete workshop agendas). Importantly, the panel asked participants to describe the impediments they confronted while attempting to blend alternative (usually private sector) data sources to improve national statistics. Employees from statistical agencies in the United States and Europe, researchers, and private sector representatives shared lessons learned from these activities, including legal and technical constraints and contractual issues related to acquiring and working with private sector data. Their collective contributions were influential in the panel’s deliberations.
The absence of workshop sessions on state, tribal, territorial, and local data does not mean these data are unimportant; rather, in the panel’s opinion, previous reports and the work of ACDEB—with a sizable number of members routinely working with these data—can provide the guidance that the panel lacked either the time or expertise to address.4 As noted throughout this report, state, tribal, territorial, and local data could vastly improve the country’s knowledge of itself, including better measurements within geographical locations or among subgroups of interest.
As noted in Box 1-1, this report offers conclusions on the scope, components, and key characteristics of a 21st century national data infrastructure and the vital role of the federal statistical system in such an infrastructure.
4Advisory Committee on Data for Evidence Building (2021) identifies current obstacles to integrating state and local data into the production of national statistics.
Chapter 2 describes why the United States needs a new data infrastructure. It details current modernization efforts, opportunities afforded by the explosion of digital data produced inside and outside of government, and ongoing government-led initiatives to repair weaknesses in national statistics using alternative data sources. The chapter highlights reports that recommend the use of blended data, and it discusses recent congressional efforts, which are necessary but insufficient, to expand data access and use.
Chapter 3 describes a vision for a new data infrastructure, the expected outcomes, and the seven key attributes of that infrastructure. Each of the seven attributes is discussed in detail, with a description of the current challenges involved and the changes necessary to attain each attribute.
Chapter 4 describes the diverse data assets that can be combined for statistical purposes; the criteria that govern data acquisition, access, and use; the implications of blended data for the format of a new data infrastructure; and the associated privacy and ethical challenges. The chapter ends with a consideration of various organizational structures that may facilitate cross-sector data access and use.
Chapters 1–4 set the stage for the final chapter. In this time of innovation and change, many components of a new data infrastructure could be achieved in multiple ways. Rather than identifying sequential steps for implementing a new data infrastructure, the panel identifies short- and medium-term tasks that contribute to the implementation of the panel’s full vision. These tasks involve engaging with key stakeholders to inform appropriate next steps and to gain stakeholder support for a new data infrastructure.