Credible statistical information is foundational to the functioning of democratic societies. Just as bridges and highways facilitate the transportation necessary for commerce, statistical information informs decisions by governments, business enterprises, and individuals. The information emerges from a data infrastructure.
Historically, the U.S. national data infrastructure has relied on the operations of the federal statistical system and the data assets that it holds. Throughout the 20th century, federal statistical agencies aggregated survey responses from households and businesses to produce information about the nation and diverse subpopulations. The statistics created from such surveys provide most of what people know about the well-being of society, including health, education, employment, safety, housing, and food security. The surveys also contribute to the infrastructure for empirical social- and economic-sciences research. Research using survey-response data, with strict privacy protections, led to important discoveries about the causes and consequences of important societal challenges and also informed policymakers. Like other infrastructure, people can easily take these essential statistics for granted. Only when they are threatened do people recognize the need to protect them.
Today, paradoxically, national statistics face both grave threats and historic opportunities. Declining survey participation poses a severe threat to the quality of statistical information. Yet, at the same time, the United States has never produced a higher volume of digital data about the activities of individuals and businesses. These data are held in federal, state, and local government agencies, the private sector, and other organizations.
To address these threats and explore the opportunities, the National Academies of Sciences, Engineering, and Medicine appointed a consensus panel to develop a vision for a new data infrastructure for national statistics and social and economic research in the 21st century. This report is the first of three reports funded by the National Science Foundation to explore the many issues surrounding a new data infrastructure. The panel convened a 1.5-day virtual public workshop to seek input from key stakeholders and external experts and to discuss issues surrounding the components and key characteristics of a 21st century national data infrastructure, including governance; the capabilities, techniques, and methods required; and the sharing of data assets (e.g., federal, state, and local government, institutional, and private sector data). This report describes how the country can improve the statistical information so critical to shaping the nation’s future, by mobilizing data assets and blending them with existing survey data.
These ideas are compatible with those forwarded in 2017 by the U.S. Commission on Evidence-Based Policymaking (CEP). The report notes that only a subset of CEP’s recommendations has been incorporated into law. A new data infrastructure can take advantage of experiences in the five years following the Commission’s recommendations to further expand the value and uses of statistical data coordinated from multiple sectors. In the interest of advancing a national data infrastructure, in 2019 the National Academies’ Committee on National Statistics formulated the following definition:
The data infrastructure consists of data assets; the technologies used to discover, access, share, process, use, analyze, manage, store, preserve, protect, and secure those assets; the people, capacity, and expertise needed to manage, use, interpret, and understand data; the guidance, standards, policies, and rules that govern data access, use, and protection; the organizations and entities that manage, oversee, and govern the data infrastructure; and the communities and data subjects whose data is shared and used for statistical purposes and may be impacted by decisions that are made using those data assets.
A new data infrastructure can mobilize the nation’s relevant data assets by combining data across sectors, to improve existing statistical products, create new ones, and strengthen research capabilities. Modern technologies permitting secure access to multiple datasets along with new computational methods can allow the blending of data for more timely, granular, and accurate statistics. Further, this blending can involve enhanced privacy protections for data subjects and holders. New privacy-enhancing tools can minimize threats to individuals. However, the mere availability of new data assets and technologies to improve the nation’s statistical information and
research base is not enough—the United States needs a vision for new partnerships across data holders to take advantage of a new data infrastructure.
The United States needs a new 21st century national data infrastructure that blends data from multiple sources to improve the quality, timeliness, granularity, and usefulness of national statistics, facilitates more rigorous social and economic research, and supports evidence-based policymaking and program evaluations. (Conclusion 2-1)
In the panel’s view, a new data infrastructure should allow statistical agencies and other approved users (federal, state, tribal, territory, and local government employees and researchers) to use the country’s data assets for purposes of the common good. These assets include data from federal statistical, program, and administrative agencies; state, tribal, territory, and local governments; private sector companies; nonprofits and academic institutions; and crowdsourcing and citizen science operations. Innovative pilot projects now offer convincing proof of the potential value to be gained from more effective use of the nation’s data resources.
This report is the first in a series intended to help build a vision for a new data infrastructure for the common good. This report describes the need for a new national data infrastructure, presents an initial vision, and describes expected outcomes and key attributes of a new national data infrastructure. The report also discusses the implications of blending data from multiple sources as well as the organizational implications of cross-sector data access and use. The report concludes by identifying short- and medium-term activities that facilitate progress toward the full vision. This report does not examine the logical, physical, or technical architecture for a new infrastructure or specific technical capabilities related to data formats or metadata, encryption or security protocols, access controls, or organizational functions and responsibilities. Future reports will explore associated topics in greater depth, including case studies and implications of blending multiple sources, data equity, and other relevant data infrastructure issues, challenges, and opportunities identified during each panel’s deliberations. The existing data ecosystem is evolving rapidly and the goal of each subsequent report is to respond to these changes and focus on the specific issues, opportunities, and challenges deemed most relevant to implementing and operationalizing the different components of a new data infrastructure.
ATTRIBUTES OF THE VISION
The panel identified seven key attributes of a new data infrastructure (see Box S-1). These attributes are detailed in the sections that follow, along with associated short-term actions that should be undertaken to begin
building a 21st century national data infrastructure for social and economic data, and the research the infrastructure could facilitate.
Safeguards and Advanced Privacy-Enhancing Practices to Minimize Possible Individual Harm
The panel notes that the social benefits of statistical information need not come at the price of increased threats to individuals’ privacy and confidentiality; the interests and rights of data subjects must be respected. In the panel’s view, any harm to individuals from building and operating a new data infrastructure should be minimized. Novel technologies and strong regulations can be applied to strengthen safeguards for individuals. Laws, regulations, and practices should employ the most current and effective tools to protect the privacy of individuals. Furthermore, the blending of data involves little new data collection; instead, existing data will be used more efficiently. With harm minimized and benefits increased by improved statistical information, society will be better served.
It is ethically necessary and technically possible to preserve privacy and fulfill confidentiality pledges regarding data while simultaneously expanding the statistical uses of diverse data sources. (Conclusion 3-1)
- Establish mechanisms to engage stakeholders (including data subjects, data holders, and other responsible organizations) regarding data safeguard prerequisites for building trust.
- Develop a strategy to ensure key data safeguards are communicated effectively and transparently.
- Establish technical specifications of privacy-preserving and confidentiality-protecting designs.
Statistical Uses Only, for Common-Good Information, with Statistical Aggregates Freely Shared with All
In the panel’s vision, data-infrastructure resources produce nonidentifiable aggregates, estimates, and statistics, to use statistical aggregation to create useful information for society and decisionmakers without harming individuals. Data-infrastructure operations and decisions are consistent with professional principles and practices, meet ethical standards, are conducted by organizations free of political interference, and are managed to ensure privacy and security.1 Data cannot be used for the enforcement of laws or regulations affecting any individual data subject.
- Use pilots to promote a wider understanding of “statistical uses.”
- Convene stakeholders to determine how to best describe new statistical products and distinguish them from privacy-threatening initiatives.
- Monitor outcomes of the new Standard Application Process (SAP) for research use of shared data to demonstrate value.
- Launch a communication campaign about the value of research as a “statistical use” of data.
Mobilization of Relevant Digital Data Assets, Blended in Statistical Aggregates to Provide Benefits to Data Holders, with Societal Benefits Proportionate to Possible Costs and Risks
In the panel’s vision, a new data infrastructure should have access to relevant, existing, digital assets for the creation of essential aggregates.
1 In this chapter and throughout the report, the term “professionalism” in compiling national statistics—either within the existing federal statistical system or the new data infrastructure—is based on authoritative information presented by the National Academies of Sciences, Engineering, and Medicine (2021).
The infrastructure should mobilize and leverage data assets across various sectors. Each data asset has strengths and weaknesses; counterbalancing features by blending data sources results in improved information.
Data from federal, state, tribal, territory, and local governments; the private sector; nonprofits and academic institutions; and crowdsourced and citizen-science data holders are crucial components of a 21st century national data infrastructure. (Conclusion 4-1)
In the panel’s vision, a new data infrastructure will include a wider variety of data holders, data subjects, data seekers, and data users than in the past. To achieve the support of all involved parties, demonstrating the benefits of expanded data sharing and blending is critical.
Data sharing is incentivized when all data holders enjoy tangible benefits valuable to their missions, and when societal benefits are proportionate to possible costs and risks. (Conclusion 3-2)
- Seek researcher input regarding SAP implementation as an access tool.
- Monitor activities of the Interagency Council on Statistical Policy working group on private sector data.
- Monitor “data-connecting” pilots collecting data at the data holder’s site.
- Publish criteria for prioritizing new data assets.
- Convene a group to evaluate methods for documenting and, possibly, quantifying benefits and costs.
- Identify blended statistics generated by statistical agencies; document and, possibly, quantify benefits and costs.
- Monitor pilot projects for blended federal/state/local data.
- Consider the feasibility and means of covering some data-holder costs associated with data sharing.
Reformed Legal Authorities Protecting All Parties’ Interests
Under the Foundations for Evidence-Based Policymaking Act of 2018 (hereafter, Evidence Act), federal statistical agencies have the right to use federal program data for statistical uses only, unless directly prohibited by law. However, many laws and regulations do prohibit federal statistical agencies from using existing data for statistical purposes. In the panel’s vision of a 21st century national data infrastructure, it is assumed that the
legislative and regulatory recommendations stemming from the Evidence Act will be initiated, but more work is needed to bolster data safeguards and broaden data access.
Legal and regulatory changes are necessary to achieve the full promise of a 21st century national data infrastructure. (Conclusion 3-3)
- Legislation establishing the design, authorities, and funding for the National Secure Data Service (NSDS).2
- Implement Evidence Act regulations and rule-making.
- Identify legislation/regulatory priorities regarding CEP state-related recommendations.
- Develop a data synchronization bill3 legislative strategy.
- Identify legal options that would incentivize data holders to share data.
Governance Framework and Standards Effectively Supporting Operations
In the panel’s opinion, legal reforms enabling a new data infrastructure must be accompanied by a set of practices and policies consistent with the spirit of the law. Such a data-governance framework includes guiding principles, authorities, structures, and directives for the infrastructure. Data governance involves active stakeholder engagement, oversight protocols, open and transparent communications, and accountability. Standards in data definitions and access protocols are critical to providing interoperability across partners essential to a new data infrastructure. In addition, professional staff throughout the infrastructure will require an environment that supports interoperability and provides them with modern skills and technology.
Effective data governance is critical and should be inclusive and accountable; governance policies and standards facilitating interoperability include key stakeholders and oversight bodies. (Conclusion 3-4)
2 The CHIPS and Science Act (P.L. 117-167) was signed into law on August 9th, 2022. It allocates funding for an unnamed demonstration project to inform establishment of NSDS.
3 Such legislation would revise Internal Revenue Service regulations to allow the U.S. Census Bureau to share limited business tax data with the Bureau of Labor Statistics and the Bureau of Economic Analysis.
- Convene potential data-holding organizations.
- Document current practices in data access.
- Catalog current data platforms of potential data-sharing organizations.
- Document current methods of data curation, protection, and preservation.
- Document existing metadata practices.
- Identify priorities for standards development.
- Draft data-sharing guidelines.
Transparency to the Public Regarding Analytical Operations Using the Infrastructure
In the panel’s opinion, at any time, the public, data holders, and data subjects should be able to understand how their data are used, by whom, for what purposes, and to what societal benefit. Transparency is a prerequisite for accountability, enabling the public to express concerns, seek redress, and oversee compliance with a new data infrastructure’s stated mission. Transparency is also a prerequisite for public trust. Trust and transparent procedures will enhance the credibility of the statistical information produced through a new data infrastructure.
Trust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders. (Conclusion 3-5)
- Identify communication priorities regarding transparency.
- Sponsor public discussion regarding alternative oversight structures to achieve transparency.
- Engage stakeholders to evaluate alternative approaches.
State-of-the-Art Practices for Access, Statistical, Coordination, and Computational Activities; Continuously Improved to Efficiently Create Increasingly Secure and Useful Information
The panel predicts that the technical aspects of a new data infrastructure will be highly dynamic. New developments in remote access, cybersecurity, cryptography, and computational approaches are constantly emerging. Thus, in the panel’s view, operations within a new data infrastructure must
continually innovate and improve. On the computational and statistical side, the infrastructure must be able to blend data for more insightful research and statistical products. The acquisition, access, and use of diverse data assets held by multiple organizations in various sectors will likely involve new partners with divergent experiences and expertise. The dynamic nature of all these features demands continuous refreshing of the skill mix of the infrastructure’s operational staff.
The operations of a new data infrastructure would benefit from the inclusion of continually evolving practices, methods, technologies, and skills, to ethically leverage new technologies and advanced methods. (Conclusion 3-6)
- Exchange knowledge about needed staff skillsets to support new operations of infrastructure.
- Build communities of practice to catalyze the technical skills base.
- Develop a professional culture within pilot projects for data protection.
- Develop organizational procedures for continuous updating of tools and practices.
Many Options for Supporting a New Data Infrastructure
Alternative ideas have been promulgated for organizing statistical operations within a new data infrastructure. The key new entity (or set of entities) needed is not a data warehouse, but rather a computational resource for linking data files in diverse ways, to produce blended statistics. The panel foresees several potential organizational models for this new entity: within the federal government, outside the federal government, or as a new public-private partnership. To identify the best option, the panel suggests the initiation of a widespread dialogue involving the many stakeholders of a new data infrastructure.
- Monitor America’s DataHub Consortium capabilities for collaborative research partnerships and data sharing.
- Clarify data infrastructure roles and responsibilities.
- Identify NSDS-provided services and capabilities.
- Clarify federal statistical research data center services and capabilities.
- Sponsor bipartisan, multisector dialogue on how best to govern private sector data use for national statistical purposes.
- Expand voluntary private sector data sharing for statistical uses.
Building a New Data Infrastructure
This report comes at a time of unusual change. Numerous research and statistical agency initiatives are blending data from multiple sources, and these initiatives will undoubtedly inform future activities, including social and economic research. While some new laws and regulations have been enacted, obstacles remain, and more legislative work needs to be done. Some initial building blocks of a new data infrastructure are under construction but lack a coordinated vision. For example, CEP proposed NSDS to produce statistical information for evidence-building by temporarily accessing multiple federal programs and statistical data assets and blending them as needed. The panel sees the value of NSDS for informing the public about the well-being of the economy and society, and for advising future data-blending entities.
There are many ways to achieve the vision of a 21st century national data infrastructure, and it is too early to identify each step necessary to achieve that vision. The panel suggests leveraging the many ongoing initiatives, both domestically and internationally, looking for early examples of success. This will require forging new partnerships with data holders, key data-infrastructure entities, and interested stakeholders. In addition to the short-term activities associated with each of the seven key attributes and potential organizational models mentioned above, the panel identified a set of medium-term activities (Table 5-1) that could be performed to discern the best ways for the United States to proceed toward the panel’s full vision.
In building a 21st century national data infrastructure, early success may come first from integrating data that are relatively easily available, demonstrating the utility of improved statistical information of national importance, and constructing effective partnerships for necessary legal change. (Conclusion 5-1)
The United States is capable of building a new national data infrastructure. If such an infrastructure is designed appropriately, the American public and decisionmakers could enjoy more timely, granular, and accurate information about the country’s employment, housing, income, education, health, safety, transportation, and food security.