Building a 21st Century National Data Infrastructure Requires Identifying Short- and Medium-Term Activities
While envisioning a coordinated 21st century national data infrastructure is a necessary first step, a vision alone is insufficient. The vision requires accessibility and the use of data for common-good purposes. The assets tapped should include data from the private sector, federal statistical agencies, federal program agencies, state and local government agencies, and other data holders. Such a vision will require trust, data safeguards, legislation, organizational entities, and partnerships that do not yet exist.
This report was written at a time of unusual change. As noted in earlier chapters, the 20th-century infrastructure that produced social and economic statistics was dependent on statistical sampling, self-report surveys and censuses, and the collection, storage, and security of data within secure facilities. Declining participation rates in surveys and censuses have resulted in higher costs and increased risk of flawed statistical estimates. Fortunately, there are a growing number of research and development activities attempting to repair these weaknesses, by combining survey and census data with other data sources. Further, for the first time in decades, the Foundations for Evidence-Based Policymaking Act of 2018 (hereafter, Evidence Act; U.S. Congress, 2019) has allowed the combining of administrative data of federal program agencies with statistical surveys and censuses. However, the Evidence Act incompletely implements the vision of the Commission on Evidence-based Policymaking (CEP), whose report was the basis for the legislation.
At the time of this writing, some of the initial building blocks of a 21st century national data infrastructure are already being constructed. Some building blocks implement new laws and regulations, like the Evidence
Act, while others involve an innovative blending of data to answer specific statistical questions. In this chapter, the logical next steps for capitalizing on these initiatives are reviewed. The panel recognizes the importance of being explicit about how its work might be understood in the context of these other ongoing activities.
First, CEP recommended that the protection of data subjects and data holders is a central feature of a new data infrastructure, and the panel fully espouses the same priority (see Chapter 3). However, in the panel’s view, merely taking a legal perspective on privacy is inadequate—a more comprehensive view of the ethical foundations of privacy protection is appropriate. Moreover, since eliminating privacy risks is impossible without foregoing uses of data that can greatly benefit the common good, an ethical approach to privacy requires weighing disclosure risks against social benefits. Further, like CEP, the panel holds the view that new technical developments can protect data at the same time that statistical uses can improve the well-being of the population. That is, responsible privacy protection and the use of privacy-enhancing technologies are compatible with expanded statistical uses of data.
Second, CEP recommended that state earnings data and state-collected data acquired by federal departments be shared for evidence-building purposes (Commission on Evidence-Based Policymaking, 2017). The panel shares that judgment because state data could enhance understanding of the current challenges and performances of the job and labor markets. In addition, following the recommendation of the Advisory Committee on Data for Evidence Building ([ACDEB] Advisory Committee on Data for Evidence Building, 2021), the panel advises that increased sharing of other state- and local-government administrative data could be useful for social and economic statistics (e.g., digital data on criminal incidents) and could benefit state and local governments in other ways. All the logic that supports the blending of federal administrative data with statistical survey data to construct better statistical information applies to state and local government data as well.
Third, CEP is silent on statistical uses of private sector data for the benefit of common-good statistics. Private sector data were not part of the scope of the evidence-building charge to the Commission. In contrast, as discussed in earlier chapters, the panel sees the merit of blending private sector data with other government sources of administrative and statistical data to produce more granular, timely, and relevant information about the economy and society.
Fourth, CEP recommended establishing a National Secure Data Service (NSDS) for the creation of statistical information for evidence building. The panel, too, sees the value of such a service. Beyond evidence building, the panel sees a further value of NSDS for facilitating the blending of diverse
data sources from the private sector. Alternatively, as noted in Chapter 4, a separate facility could be built to facilitate new statistical information from the blending of private sector and government-sector data. Thus, in the panel’s view, there are multiple possible ways forward in terms of building a structure to support a new data infrastructure.
Fifth, the Evidence Act gives statistical agencies access to federal administrative data, unless specifically prohibited by law. While this is an important advance in improving statistical information, additional changes in laws and regulations are needed to permit the expanded use of federal, state, and local government administrative data for purely statistical uses. Engaging sovereign tribes and territories to access administrative data from their governments require a separate approach.
The panel does not attempt to identify each of the sequential steps necessary to achieve a new data infrastructure—many of the methods to achieve the panel’s vision are feasible but dependent on building support among key stakeholders. Instead, the panel has identified short-term and medium-term activities that could be performed to discern the best ways for the United States to progress toward the panel’s full vision of a new data infrastructure. Later steps in achieving the vision will be dependent on:
- Further legislation to implement CEP recommendations;
- The technical and statistical outcomes of the many pilot projects now ongoing;
- How private sector stakeholders and other nongovernmental data holders evolve in their contributions to national statistics that promote the common good;
- Future refinement of NSDS vision;
- How data sharing under NSDS evolves; and
- Concurrent changes in federal statistical agencies.
In considering its vision, the panel assumes that legislative changes will have important implications for private sector incentives for sharing data to improve statistical information. Legal reform will likely also inform organizational and governance features of a new data infrastructure. Similarly, budgetary support for the infrastructure will be established.
This chapter is organized around the same seven attributes of a new data infrastructure described in Chapter 3. The panel offers some short-term and medium-term tasks associated with each key attribute and the organizations/partnerships of a new data infrastructure; these tasks are summarized in Table 5-1.
ATTRIBUTE 1: SAFEGUARDS AND ADVANCED PRIVACY-ENHANCING PRACTICES TO MINIMIZE POSSIBLE INDIVIDUAL HARM
All parts of society—the private sector, nonprofit organizations, academia, the government, and the U.S. public—have learned how disclosure of private information can result in harm. Harm is often more acute in vulnerable communities. The panel envisions a turnaround: the use of the same data for the common good of society. To achieve this vision, a new data infrastructure must, in the panel’s judgment, have high levels of data protection, trust, and equity designed into its core. In short, building a new data infrastructure that mitigates the risk of individual harm and maximizes widespread and equitable benefits is required for its legitimacy.
Notions of privacy are one example of concerns connected to more basic values (Beauchamp and Childress, 2001). The values that guide the actions of a new data infrastructure call for an orientation to the data subject. First, how are the digital data connected to data subjects affecting their lives? Second, there are underlying issues of “autonomy”—the ability of individuals to make their own decisions. In the panel’s view, a new data infrastructure must recognize the nature of informed consent by the data subject. Third, there is concern about “beneficence”—to what extent will the data be used to produce good outcomes for the data subject? Finally, there is a focus on human dignity—are the uses of a new data infrastructure conducted in a manner that is respectful of the data subjects? These underlying values are important for a new data infrastructure because they are related to the development of trust between those whose data resides in the infrastructure and the outcomes of that infrastructure.
As noted in Chapter 3, societal trust begins with legitimacy as sanctioned by credible institutions—formal assurance, through law and technology, that data will be safeguarded, secured, protected, and used responsibly and ethically for approved statistical purposes. But, in the panel’s judgment, this is not enough. Community-oriented legitimacy is also essential. Mandates preserve privacy and protect confidentiality. Transparent procedures allow data holders to understand how their data are being used, by whom, and for what societal benefit. Policies and principles inform day-to-day practices and underpin effective controls related to data access and use. Practices implement advanced protective cybersecurity measures, facilitate shared computing approaches that preserve privacy and protect the confidentiality, and mandate transparency and stakeholder engagement. Effective monitoring, enforcement, evaluation, and accountability mitigate risk and strengthen trust.
In the panel’s view, all future visions of a new data infrastructure need a stronger set of safeguards to assure that data will not be used to harm any
individual or data subject. The panel notes that the societal benefits of statistical information do not have to come at the price of the increased threat to privacy and confidentiality if this threat is effectively and proactively addressed. Statistical analyses of existing data resources can provide information about the well-being of those in society and can contribute to the common good. Attention to the broader ecosystem of data can also highlight data inequities that can unduly harm certain communities. The panel’s vision of a new data infrastructure proposes little new data collection but instead suggests the expanded and responsible use of existing data to produce better information about critical features of society.
Oversight functions with some authority to evaluate the performance of data protections have been introduced usefully in other countries (see Chapter 3). Some countries have demonstrated ways to be more transparent about their practices, operations, and activities. It is the panel’s opinion that trust in a new data infrastructure requires such transparency.
Currently, federal data resources held by statistical agencies, once collected, are protected by laws prohibiting their use for nonstatistical or inappropriate purposes. What is largely absent, however, are features that include active participation by those whose data are held by the agencies. In the panel’s judgment, active engagement of data holders, subjects, and other stakeholders is needed, in the development of data-infrastructure policies that affect them. With such a formal partnership, the building of trust might be more easily achieved.
As a short-term task, the panel suggests that the United States begin dialogues and convenings to discuss practical vehicles for building a privacy-protecting culture, as an integral part of a new data infrastructure. The dialogues should take the perspectives of both the data subject and the data holder. The values underlying the proposed safeguard practices of a new data infrastructure should be made clear. The panel supports CEP’s evocation of the principle of “humility”—the notion that data use must not be driven exclusively by the analysts but should involve the concerns of data subjects. In the short term, these dialogues with data subjects and data holders should envision and evaluate alternative structures and practices that would continue the value-based devotion to a data-subject orientation, while incorporating new privacy-protecting tools created over time. These convenings would establish mechanisms to engage all stakeholders regarding the data-safeguard prerequisites necessary to build trust. Convenings could also help develop a strategy for ensuring data safeguards are communicated effectively and transparently to data subjects, data holders, and other important stakeholders. Finally, the convenings could establish technical specifications of privacy-preserving and confidentiality-protecting designs.
The medium-term tasks could focus on the establishment of organizational mechanisms for oversight that safeguards privacy and confidentiality
procedures. The oversight mechanisms and procedures could become features of existing advisory structures of statistical agencies. The development and sharing of relevant procedures, processes, and practices for protecting the rights of data subjects and data holders could also result.
ATTRIBUTE 2: STATISTICAL USES ONLY, FOR COMMON-GOOD INFORMATION, WITH STATISTICAL AGGREGATES FREELY SHARED WITH ALL
Data on individuals and economic units can be used for administrative procedures and/or statistical purposes (see Chapter 3). The administrative use of data regarding the attributes of an individual can affect that individual by granting or denying that person some benefit. The Evidence Act noted that administrative uses of data require access to identifiable data on an individual and have “administrative, regulatory, law enforcement, adjudicatory, or other purposes that affect the rights, privileges and benefits of a particular identifiable respondent…” (U.S. Congress, 2019, 44 USC 3561). Statistical uses may act on the same set of data but, instead of producing actions on individuals, produce aggregated, estimated, or modeled information. CEP focused attention on the use of statistical practices to evaluate government programs—evidence building.
Both CEP and the panel noted that data collected for administrative purposes can be valuable when blended with data collected for statistical purposes only. Note that federal statistical agencies have missions limited exclusively to statistical uses of data. The Evidence Act granted statistical agencies the right to acquire federal-agency administrative data for statistical purposes unless such use is prohibited by another law.
The panel expects that, under full implementation of the Evidence Act, statistical agencies will gain experience blending federal administrative data with their survey and census data. Data blending experiences could build a culture in which the distinction between administrative and statistical uses of data are better clarified by stakeholder groups. This distinction is important in building societal trust regarding the benefits of a new data infrastructure, relative to any risks that may exist. In the panel’s summation, an important goal is to clearly frame the distinction between “statistical uses” and “administrative uses” of data in public discourse.
Ongoing pilot projects at federal statistical agencies and among academic researchers can illustrate the production of improved statistical information by the blending of data from multiple sources. Statistical uses of data can produce a much more favorable benefit-threat balance than can administrative uses, though it is unclear whether the public understands this distinction. Public support for a new data infrastructure could be enhanced by a widespread understanding of this difference.
In the short term, it would be useful to make the value of new statistical information more publicly visible via the internet and other accessible media. Such communication might build a wider understanding of the value of a new data infrastructure for solely statistical uses. A description explaining the autonomy of a new data infrastructure, in terms of freedom from political interference, could also contribute to public acceptance.
Similarly, stakeholder convenings could be useful to mount a dialogue about the best way to describe a new data infrastructure’s statistical products, distinguishing statistical uses from uses that threaten abuse of individuals’ data. Another short-term task could involve monitoring the outcomes of the newly introduced Standard Application Process (SAP), which is when blended data are used to produce new research products (Marten, 2022). By definition, those research products are statistical uses of data. A communications campaign that alerts the public about the value of such research could be useful.
In the medium term, work on this “statistical uses” attribute could guide reform efforts for the regulatory environment of a new data infrastructure. In the new legislation, the treatment of “statistical uses” in contrast to “administrative uses” is important to build the trust of data subjects and data holders in a new data infrastructure.
ATTRIBUTE 3: MOBILIZATION OF RELEVANT DIGITAL DATA ASSETS, BLENDED IN STATISTICAL AGGREGATES TO PROVIDE BENEFITS TO DATA HOLDERS, WITH SOCIETAL BENEFITS PROPORTIONATE TO POSSIBLE COSTS AND RISKS
The variety and volume of the potential data assets available to a new data infrastructure are large. While the 20th-century data infrastructure relied on the physical movement of data from one holder to another or relatively small amounts of digital data, the size of current and future datasets useful to society dictates that they must be accessed digitally from the data holder or owner—these datasets are too big to move from place to place. Although it is a significant change from existing practices, this shift—from collecting data, moving them to the statistical agency’s facility, and processing them there, to data access and processing at the data owner’s facility—offers many potential benefits. This “distributed data-connecting” model is now being piloted (see Chapter 4). Monitoring and learning from such pilots are short-term tasks that can provide valuable insights into the capabilities and challenges of this new approach.
How should data be accessed in a new data infrastructure? First, the panel notes the recent implementation of the Evidence Act’s SAP by the Interagency Council on Statistical Policy (ICSP) and the Office of Management and Budget’s (OMB’s) Statistical and Science Policy Branch (Marten,
2022). The SAP allows qualified researchers and others to apply, become qualified, and obtain approval to access existing federal statistical agency data, and it appears to be the primary mechanism for such access. Under the SAP, researcher access will generally, but not always, be through federal statistical research data centers (FSRDCs)—a network of 31 sites across the United States that have expanded the insights obtained from data through original analyses that must be made publicly available. FSRDCs are university-statistical agency collaborations that also offer researchers support in understanding and accessing data, as well as both physical and virtual access to data. While CEP Recommendation 2-8 focused on researchers (Commission on Evidence-Based Policymaking, 2017), the Evidence Act and ICSP propose that federal agencies, the Congressional Budget Office, state, local, and tribal governments, researchers, and other individuals use the SAP to apply for access to confidential statistical data assets for purposes of developing evidence (Federal Register, 2022). Under these proposals, other executive-branch agencies or units could also utilize the SAP to accept applications for access to their confidential data assets. Monitoring the implementation and effectiveness of the SAP as an access tool, distinct from the value of the resulting research (Attribute 2, short-term task) could provide important insights into the SAP’s ability to serve a broader user base than just researchers. Medium-term tasks include monitoring if the SAP is used as a data-access tool for the National Science Foundation’s (NSF’s) America’s Data Hub Consortium (ADC).1
As mentioned previously, the Evidence Act has provided broader statutory authority for combining data from federal administrative and statistical agencies, unless prohibited by law. The Evidence Act did not, however, implement CEP recommendations related to providing access to important state administrative data. Advisory Committee on Data for Evidence Building did recommend legislation broadening access to state data (2021) and, more recently, the ACDEB Subcommittee on Governance, Accountability, and Transparency recommended use cases that access, link, and analyze federal, state, and local government data assets (Cutshall and Lane, 2022). Use cases focused on education and workforce data, health statistics, and labor-market activity. In the panel’s opinion, these pilots are a necessary step forward in demonstrating the value of a new data infrastructure, as well as identifying barriers and useful data-governance frameworks.
CEP’s recommendations, the Evidence Act, and ACDEB’s proposals lead the panel to conclude that the first data assets to be acquired should be those of federal program agencies, followed by those of federally funded state programs for which access is legally permitted. During this period,
1 See: https://beta.nsf.gov/science-matters/americas-datahub-consortium-seeing-and-understanding-entire-elephant?utm_medium=email&utm_source=govdelivery
dialogue with private sector and other nongovernmental data holders can sharpen mutual understanding of the value of data sharing for national statistical purposes. In addition, in the short term, the activities of the ICSP working group on the use of private sector data (see Chapter 2) should be monitored.
In the panel’s judgment, additional decision criteria are needed to decide the order by which data assets should be added to a new data infrastructure. To establish data-asset priorities, it is useful to consider criteria that can be used to rate various types of data assets. To stimulate discussion, the panel suggests the criteria in Box 5-1.
In the panel’s vision of a new data infrastructure, NSDS or its demonstration pilots,2 like America’s DataHub Consortium (ADC), will have access to federal agency program data that are not explicitly prohibited for such statistical uses. In the panel’s opinion, those data should be the priority for expanded statistical uses. Among those are data assets that offer full coverage of important target populations (e.g., Medicare, Supplemental Nutrition Assistance Program) and those with standardized and stable data documentation. Broadening access to data assets that are already being used by statistical agencies may be easier than attempting to use the data for different purposes, which requires the re-negotiation of existing agreements. First, newly acquired data assets should have immediate utility, in terms of
2 The CHIPS and Science Act (PL 117-167) was signed into law on August 9, 2022. It allocates funding for an unnamed demonstration project to inform establishment of NSDS.
improving statistical products and research productivity. Next, the current data holder should be able to quickly acquire the technical skills necessary to permit data access by NSDS. Ideally, the first newly acquired data assets, when blended with other data, would produce statistical information and research products that would provide new insights into the functioning of the economy or society at large.
In short, in the panel’s judgment, the first new data assets to be added to those currently used for statistical purposes should be easily acquired and demonstrate the value of blending data from diverse sources for an increased understanding of national issues.
In building a 21st century data infrastructure, early success may come first from integrating data that are relatively easily available, demonstrating the utility of improved statistical information of national importance, and constructing effective partnerships for necessary legal changes. (Conclusion 5-1)
In the panel’s opinion, evaluations of existing efforts to access state government data for statistical purposes should begin as soon as feasible, as in some of the use cases proposed by ACDEB (e.g., Cutshell and Lane, 2022). These data could build upon existing federal survey and administrative data. The panel expects, as did CEP, that state and local government program data from federally funded programs be the first to be accessed when permitted by law. However, the panel acknowledges that the multiple jurisdictions involved may pose complications greater than those presented by the relatively small number of federal agencies collecting societally relevant data.
A key feature of a new data infrastructure is the reciprocation principle: data holders that share their data will benefit from new statistical information useful to their operations. While the panel expects that many jurisdictions will learn from comparing their statistics to those of other areas, some jurisdictions will suggest new statistical products that require development. In the panel’s view, in the short term, attention should be paid to actively learning about how jurisdictions could benefit from sharing their data for statistical purposes. While short-term efforts will pay large dividends, the panel expects that accessing state- and local-government data resources will require more time.
In the short-term, statistical agencies will continue to acquire and use private sector and other nongovernmental data for statistical purposes. In the panel’s view, these initiatives should be closely monitored by agency decisionmakers, and lessons learned should be shared across the statistical system. Early “data-connecting” pilots that will access and process data at data holders’ sites are precursors of a future access strategy, and a key, specified feature of NSDS. In the panel’s view, understanding the capabilities,
expertise, and challenges associated with this approach is necessary and will be informative.
Finally, not only data holders will benefit from data sharing. In the panel’s judgment, societal benefits should be proportionate to the possible costs and risks of acquiring and using a data asset. In the short term, ongoing initiatives must be examined to identify associated benefits and costs. In the panel’s judgment, it would be useful to convene a group to evaluate various methods for documenting and possibly quantifying the benefits and costs of acquiring and using data. In some cases, to incentivize state and local data holders, a new data infrastructure may have to help cover the costs incurred by data holders.
Medium-term tasks include accessing using federal administrative data as well as state and local data, implementing “data connecting” prototypes into a true statistical production system, and evaluating ADC’s use of the SAP as a possible access tool. Learning from these existing efforts will help clarify and refine data governance, access, and use policies, rules, and procedures.
ATTRIBUTE 4: REFORMED LEGAL AUTHORITIES PROTECTING ALL PARTIES’ INTERESTS
In the panel’s view, to create new statistical information valuable to the country, data with the fewest regulatory or logistical impediments could be accessed first by a new data infrastructure. Next, data that have potential value but have technical or logistic impediments should be acquired. Finally, changes in regulations or laws must occur before certain data can become part of a new data infrastructure. Figure 5-1 illustrates the basic steps of this transition.
A new data infrastructure will utilize new technologies to access data without possessing them. This is a specified feature of NSDS which, in the panel’s view, is likely to be a key feature of the success of a new data infrastructure—assuming that the operational requisites to access federal program agency data occur.
There are logical sequences to further regulatory reform. One of the most important CEP recommendations that were not included in the Evidence Act was the establishment of NSDS. Advisory Committee on Data for Evidence Building (2021) affirmed the need for NSDS, and the former co-chairs of the Evidence Commission encouraged U.S. Congress to include the NSDS Act in the conferenced version of the U.S. Innovation and Competition Act,3 but at the time of this writing, no legislation has been
3 See the co-chairs’ letter to Congress: http://www.datacoalition.org/wp-content/uploads/2021/11/CEP-Co-Chair-Letter-re-NSDS-11-30-2021.pdf
enacted. Clarity regarding the legislative prospects for establishing NSDS should be a short-term priority, in the panel’s opinion.4
The Evidence Act authorized the use of federal program administrative data by statistical agencies and units, but OMB rulemaking and regulations must be enacted to guide these agencies. The panel’s vision assumes that, over the short term, the blending of federal program administrative data with survey and census data will take place. CEP also recommended access to quarterly earnings data held by states, and state-collected data acquired by federal departments. Legislative and regulatory priorities regarding these CEP recommendations should be a short-term priority, in the panel’s judgment.
In the panel’s view, a first step could be to catalog all the state regulatory features that affect data sharing, especially those that might affect blending with private sector data. Any regulatory reform activities to permit such sharing of state and local government data solely for statistical purposes
4 As noted in Chapters 2 and earlier in this chapter, the CHIPS and Science Act (PL 117-167) was signed into law on August 9th, 2022. While this legislation calls for a demonstration project for NSDS, it did not formally establish a fully-functioning NSDS.
are also important short-term activities. These activities would naturally be part of any legislative or regulatory action that follows the existing Evidence Act prescriptions for data sharing. In addition, other legislation, for example, the proposals for data synchronization5 would logically receive actions, permitting sharing of specified IRS business data among the Bureau of Economic Analysis, the Bureau of Labor Statistics, and the U.S. Census Bureau, for statistical purposes. In the short term, it would be useful for an expert group to consider legislative proposals that could incentivize data holders to share their data with a new data infrastructure. Such proposals for incentives could include legal liability protection against legal actions directly related to the act of data sharing, or possible tax incentives.
Medium-term activities, according to the panel’s vision, should concentrate on reforms that involve private sector data. These might be the development of regional or sector-based hubs of shared data, permitting access to statistical information by NSDS. A data hub might constitute a new institution—a new private-public partnership. Any of these options will require careful, trust-building activities among the various sectors whose data will form part of a new infrastructure. During these activities, drafting of the legislative language to underlie the new entity facilitating sharing of private sector data should occur.
ATTRIBUTE 5: GOVERNANCE FRAMEWORK AND STANDARDS EFFECTIVELY SUPPORTING OPERATIONS
Much of the governance framework and the definition of standards necessary for a new data infrastructure (see Chapter 4) will necessarily follow the reform of the regulatory environment described above. The components of a data-governance framework involve a set of formal processes and procedures that implement the underlying principles of the infrastructure (Box 5-2).
The components in Box 5-2 have a logical sequencing, which could guide the short-term activities to implement a new data infrastructure. The choice of which type of governance body with which authorities best fits the United States naturally precedes the identification of standards, policies, and procedures. In the panel’s vision, statistical agencies play a prominent role in these discussions.
In the short term, the panel recommends that potential data-sharing organizations could be convened to foster a partnership informed by the concerns and standard practices of said organizations. Documentation of
5 In 2021, the U.S. Department of Treasury proposed changes to allow data synchronization: https://home.treasury.gov/system/files/131/General-Explanations-FY2022.pdf, pp. 101–102. In a letter to Secretary Yellen, the American Economic Association endorsed this approach: https://www.aeaweb.org/content/file?id=14973
current procedures for accessing data within data-sharing organizations could begin at these convenings, including documentation of decisionmaking procedures for granting access, both internal and external to the organization. The convenings could catalog the variety of software platforms used by potential data-sharing organizations and could assemble information on the metadata practices of various organizations. The convenings could also identify priorities for standards development. Finally, the convenings could generate reports suggesting standards that could be incorporated as part of a new data infrastructure. In a parallel set of activities, drafts of data-governance guidelines could be developed, for review by diverse stakeholder groups.
In the medium term, guidelines could be integrated into drafts of key legislative changes, so that the governance procedures and practices would have the force of statutes.
In the panel’s judgment, relevant stakeholders should be convened to begin developing standards in response to the identified data-infrastructure priorities. A group should be charged with establishing governance roles and responsibilities.
ATTRIBUTE 6: TRANSPARENCY TO THE PUBLIC REGARDING ANALYTICAL OPERATIONS USING THE DATA INFRASTRUCTURE
One of the critical design decisions for a new data infrastructure is choosing which transparency-building approaches best fit U.S. society, with its diverse interest groups. Extreme transparency would permit anyone at any given time to answer several questions:
- Which data are currently being accessed in statistical operations?
- Which data are being blended?
- What are the informational goals of blending?
- What purposes will be served and what benefits will be realized by the statistical products produced?
- How will the statistical products be distributed?
Wide dissemination of answers to these questions could create a level of transparency that might help to alleviate fears of data misuse and its associated harm to certain subpopulations. Dissemination could also raise awareness about the uses of data for the common good, which could bolster support of a new data infrastructure.
Chapter 3 reviewed various approaches for building transparency into the operations of a new data infrastructure, including a variety of structural features that could provide insight into the operations of the infrastructure. Other countries have employed alternative formal mechanisms:
- An ombudsperson to mediate the public’s, data subjects’, or data holders’ concerns with the organizations using the infrastructure;
- An information commissioner;
- A multi-person commission or other institution; and
- A Review Council that regulates data sharing.
In the panel’s opinion, all of these mechanisms gain their influence when they reveal to society how data are being used. The key stakeholders in transparency efforts are data subjects (who are described by records in the infrastructure), data holders (who are giving access to their datasets), and the general public (whose interests should be served by the statistical products of the new infrastructure).
Formal transparency-building structures could act on concerns about the failure to achieve a new data infrastructure’s mission to serve the common good. The various roles and structures implemented by other countries comprise ways to create a forum for the expression of those concerns. Legislative and regulatory reform initiatives are likely to incorporate the
chosen definitions of roles or bodies created to act on data-infrastructure concerns or failures.
In the short term, the panel recommends increased public discussion of the types of oversight likely to enhance the credibility and trustworthiness of a new data infrastructure. CEP held a series of public meetings across the country to seek such input.6 In the panel’s opinion, more such gatherings might inform alternative structures and practices that could build meaningful transparency into a new data infrastructure.
Transparency also involves taking the perspective of the stakeholders seeking to understand the operations of infrastructure. Discussions with stakeholders could identify communication priorities and evaluate alternative bodies and roles (e.g., ombuds, oversight bodies) whose purpose is informing society about the infrastructure’s current state or well-being. In the short term, discussions with stakeholders could produce a digest of the various ways data are currently curated, protected, and preserved, and could identify communication priorities.
In the medium term, the panel recommends that a communication strategy be implemented to respond to stakeholders’ priorities. Community oversight of a new data infrastructure needs to be one of the infrastructure’s key features. The ideas generated in the short-term stakeholder outreach must become part of the regulatory reform deliberations.
ATTRIBUTE 7: STATE-OF-THE-ART PRACTICES FOR ACCESS, STATISTICAL, COORDINATION, AND COMPUTATIONAL ACTIVITIES; CONTINUOUSLY IMPROVED TO EFFICIENTLY CREATE INCREASINGLY SECURE AND USEFUL INFORMATION
Earlier chapters in this report noted a large number of ongoing pilot projects, each of which is combining datasets not originally designed to be combined. All of these pilots are seeking more timely, accurate, and granular statistical information that can inform decisionmakers and the public. The pilot projects are addressing the challenges of diverse data structures, metadata standards, and regulatory restrictions. In addition, the pilot projects are necessarily innovating in terms of the technical aspects of data access and aggregation, and statistical estimation issues.
Certain high-velocity technological developments are relevant to a new data infrastructure. Cybersecurity approaches are undergoing rapid development. The role of encryption in data sharing is changing rapidly. The size of datasets has grown so large as to make infeasible their movement from one site to another. Hence, software approaches to allow remote users to
6 For a description of CEP’s public engagement, see: https://bipartisanpolicy.org/wp-content/uploads/2019/03/CEP-FAQs.pdf
access data where those data exist are increasingly being developed. The design of NSDS, as conceived by CEP, assumes no data warehousing, but real-time access, blending, and construction of statistics from multiple data sets simultaneously.7 Such approaches should profit from continuous improvements in technologies supporting multiserver and multiple-cloud use.
Many of the innovations in cybersecurity and multisite computing are taking place in private sector information firms (e.g., Microsoft Azure, Amazon EC2). The level of investment in these firms greatly exceeds that of organizations actively producing statistical products from social and economic data. Concerns about the ability of federal statistical agencies to acquire cutting-edge technical talent to support the role of the federal government in a new data infrastructure were noted in Chapter 3. It was also noted that new partnerships between the private sector, the academic sector, and the federal government might explore new approaches to this challenge. Current pilot projects are underway, involving e-commerce data for measurement of price and quantity for retail trade statistics, which place highly secure aggregation software within the protected cloud of the firm. In the panel’s vision, a similar approach could involve pre-vetted software “behind the firewall” of NSDS or the federal statistical agency producing aggregate statistical products for dissemination to the public. New partnership models seem important for the success of a new data infrastructure.
In the short term, if these pilots continue, agencies will develop new approaches to data access, matching, merging, and computation. Building a community of practice for such data blending could catalyze progress on the technical-skill base. The panel suggests targeted, periodic meetings in which tools, techniques, and skill sets are described and evaluated. Professional associations (e.g., the American Statistical Association) often serve such purposes. The Federal Committee on Statistical Methodology holds periodic conferences where such work could be showcased.
Also in the short term, the panel advises wider discussions about the need to educate staff on new procedures necessary for a new data infrastructure. Data access, transmission, curation, processing, computation, and statistical forms will all undergo continuous change over the coming years. Federal statistical agencies, other public-sector entities, and infrastructure-participating entities need a new generation of technical staff that can function across these various procedures. Staying current with new developments will require continuous updating of skills. Institutionalizing ongoing learning as a norm will necessitate additional training in the short
7 Note that this design is also compatible with the notion of data minimization—that only the data necessary for a given purpose are acquired to fulfill that purpose—as another tool to reduce risk to data subjects and data holders.
term. Academic institutions and other key stakeholders can participate in these dialogues.
Over the medium term, in the panel’s view, new partnerships need to be formed between existing statistical operations and organizations with the skills needed to create and maintain a new data infrastructure. Currently, private sector internet-information enterprises are investing in the development of new tools for cybersecurity, data access, and privacy protection. In the panel’s opinion, these investments will create new tools and practices that can benefit a new data infrastructure. Hence, collaboration across the public and private sectors will be an important vehicle for the evolution of the infrastructure.
NEW PARTNERSHIPS MUST BE FORMED
Alternative organizational models for a new data infrastructure were reviewed in Chapter 4. This section assumes that NSDS will be established and will provide access to federal, state, and local government data for federal statistical and research purposes. In the panel’s opinion, the next component to design, build, and operate should be an organizational form to facilitate access to private sector data.
Three paths exist in terms of accessing private sector data: (1) whether access to private sector data for blending should be voluntary or mandated;8 (2) whether NSDS adds private sector data to its purview or whether private sector data reside in one or more separate entities; and (3) whether the federal statistical system alone governs the entity/entities accessing private sector data, or whether governance involves a new public-private partnership. If NSDS becomes the direct portal for access to private sector data for national statistical purposes, fewer steps are necessary. If one or more new entities are established for accessing or blending private sector data, they must be designed, evaluated, and built. For ease of exposition in the discussion that follows, the new entity or entities are simply referred to as the “entity”, without assuming whether there will be one or more entities.
As reviewed in Chapter 4, the new feature common to all organizational options for an entity is a technical staff with the skills needed to develop and maintain the entity. Staff would need to be highly skilled in the curation of data for statistical purposes, as well as in safeguarding and managing data access and use. Staff would need skills specific to building software for remote access to curated data and for interacting with oversight bodies. In the panel’s view, short-term work involves designing the specific technical services to be provided by NSDS and other important data-infrastructure entities, like the FSRDC network.
8 The panel assumes that access will be voluntary.
Dialogue between governmental and private sector data holders is important in the early days of creating an entity. In the federal government, both executive- and legislative-branch involvement are important because the entity will either be an integral part of NSDS or be in frequent interaction with NSDS. With such options inevitably come unique legal and regulatory implications.
Over the short term, the lessons learned from experience with the NSF-sponsored ADC initiative will become clearer. This model of sector- and region-based data sharing for research purposes may inform the features of a new entity involving the sharing of private sector data.
Early in the development of an entity, the panel assumes that other features of NSDS will become fully formed through legislation and regulation. Finally, all the necessary data agreements between the new entity and NSDS will be drafted and vetted by leaders of private sector entities as well as government officials.
In the short term, it will also be useful, in the panel’s opinion, to clarify the roles and responsibilities of the entity, including the services and capabilities of the FSRDC network. A bipartisan, multi-sector dialogue about how best to manage and govern private sector data for national statistical purposes could have important implications for the organization of a new data infrastructure.
In the panel’s vision, medium-term tasks would involve piloting the operations of the new entity and interacting with initial private sector enterprises involved with those pilots. New pilots could usefully build on the experience, lessons learned, and challenges highlighted in the initial pilots. Scaling up the operations of an entity to handle multiple enterprises in various sectors, to provide access to diverse kinds of data for different statistical program types, will highlight additional challenges and issues. However, this scaling up will also identify benefits for private sector data holders that can be used to incentivize broader data sharing. Scaling up data-blending pilot projects to test data-service capabilities and responsiveness will provide important insights and identify issues that may need to be addressed. If a new entity other than NSDS is given responsibility for private sector data blending, a new or refined governance framework may be needed.
The panel anticipates that the sustainability of a new entity will be enhanced by forging a cooperative relationship between that entity, data holders, and key stakeholders. The reciprocal nature of the relationship could be key. Figure 5-2 shows two paths for the entity’s relationship with a data holder—a data-protection and data-sharing path, and an information-enrichment path. Participating data holders enjoy the benefits of state-of-the-art privacy protection and new information products that can help their businesses. Enhanced privacy-protection expertise could easily migrate from data accessed for statistical uses to a data holder’s entire enterprise.
In that way, a new entity hardens the country’s private sector data against cybersecurity breaches and inadvertent re-identification of data. Dialogue between the entity’s leadership and data holders would guide either path. Design and piloting work building a strong privacy-protecting environment could benefit data holders and the U.S. public. In the panel’s vision, all new information products would be publicly available and a large overlap in information needs is expected among data sharers.
There is much to do. A proposed new data infrastructure will build a coordinated ecosystem of data from all parts of society, for the benefit of the whole society. Table 5-1 presents a terse overview of the short- and medium-term tasks discussed above, for building a 21st century national data infrastructure for social and economic data, and the research said infrastructure could facilitate.
This report began with evidence that available data are insufficient to provide the United States with statistical information on critical societal features. The tools of the 20th century are not well suited to the challenges of the 21st century. At the same time, society is awash in digital data that could be used for larger societal benefits.
The panel presented a vision that moves the country toward a 21st century national data infrastructure, by mobilizing information for the
TABLE 5-1 Short- and Medium-Term Tasks for a 21st Century National Data Infrastructure
|Attribute of New Data Infrastructure||Short-Term Tasks||Medium-Term Tasks|
|Safeguards and advanced privacy-enhancing practices||
|Statistical uses only, for common-good information||
|Attribute of New Data Infrastructure||Short-Term Tasks||Medium-Term Tasks|
|Mobilization of all relevant national digital data assets||
|Reformed legal authorities||
|Governance framework and standards||
|Transparency to the public||
|Attribute of New Data Infrastructure||Short-Term Tasks||Medium-Term Tasks|
a The CHIPS and Science Act (PL 117-167) was signed into law on August 9, 2022. It allocates funding for an unnamed demonstration project to inform the establishment of NSDS.
b Such legislation would revise Internal Revenue Service regulations to allow the U.S. Census Bureau to share limited business tax data with the Bureau of Labor Statistics and the Bureau of Economic Analysis.
common good. This vision of a new data infrastructure assumes statistical agencies and approved researchers can access and blend data from multiple sources—to improve the quality, timeliness, granularity, and usefulness of statistics; to facilitate more rigorous social and economic research, and to support evidence-based policymaking and program evaluation. In this vision, effective and strengthened data safeguards will secure data, preserve privacy, and protect confidentiality while minimizing individual harm. Safeguard mechanisms and measures will be communicated and understood. The public and data holders will see how their data are used, by whom, for what purposes, and to what societal benefit, instilling confidence that their data will be used responsibly and ethically and only for approved statistical purposes. The public, data holders, data subjects, and other important constituencies will be engaged in standards development, data governance, and other decisions that affect them, strengthening trust in a new data infrastructure.
In the panel’s opinion, a new data infrastructure should not only provide tangible benefits for the common good, but also ensure societal benefits proportionate to the possible costs and risks of acquiring and using a data asset. The panel’s vision of a new data infrastructure supports the two-way flow of information from data holders to statistical agencies and back again. In the panel’s view, statistical agencies should provide useful information and services back to data holders that inform data holders’ decisions, operations, and activities. In turn, the public, data holders, and key stakeholders should support legislation and other changes that facilitate and support expanded data access and use.
This page intentionally left blank.