The National Academies of Sciences, Engineering, and Medicine | Toward a 21st Century National Data Infrastructure: An Interactive Overview of CNSTAT’s Visioning Series

The beneficial outcomes of the new national data infrastructure (mentioned above) are possible provided a new data infrastructure is guided by overarching principles or attributes. The seven key attributes the panel identified are described below. Click to expand the bars corresponding with each attribute to reveal the short-term and medium-term actions that the panel believes should be conducted to help achieve its vision.

In building a 21st century data infrastructure, early success may come first from integrating data that are readily easily available, demonstrating the utility of improved statistical information of national importance, and constructing effective partnerships for necessary legal change. (Conclusion 5-1)

The United States is capable of building this new national data infrastructure. With appropriate design of operations using the data, the American public and decision-makers would enjoy more timely, granular, and accurate information and a more robust research infrastructure.

The social benefits of statistical information need not come at the price of increased threats to individuals’ privacy and confidentiality. Any harm to individuals from building and operating this infrastructure should be minimized and inadvertent. New technologies and strong regulations can strengthen safeguards for individuals.

It is ethically necessary and technically possible to preserve privacy and fulfill confidentiality pledges regarding data while simultaneously expanding the statistical uses of diverse data sources. (Conclusion 3-1)

Short-term actions to achieve this:

Establish mechanisms to engage stakeholders (including data subjects and data holders) regarding data safeguard prerequisites for building trust
Develop strategy for ensuring key data safeguards are communicated effectively and transparently; and
Establish technical specifications of privacy-preserving and confidentiality-protecting designs

Medium-term actions to achieve this:

Propose community and data-holder council to ensure data subjects’ interests are respected
Publish safeguard procedures and mechanisms
Publish privacy/confidentiality procedures
Establish external council for safeguard oversight

Data infrastructure resources produce non-identifiable aggregates, estimates, and statistics to create useful information for society and decision-makers without harming individuals. Data infrastructure operations and decisions are consistent with professional principles and practices, ethical standards, conducted by organizations free of political interference, and managed to ensure privacy and security. Confidential data cannot be used for enforcement of any laws or regulations affecting any individual data subject.

Short-term actions to achieve this:

Use pilots to promote wider understanding of “statistical uses”
Convene stakeholders to determine how best to describe new statistical products and distinguish them from privacy-threatening initiatives
Monitor outcomes of the new Standard Access Process (SAP) for research uses of shared data to demonstrate value
Launch communication campaign about the value of research as a “statistical use” of data

Medium-term actions to achieve this:

Improve legislative language describing value of statistical uses versus administrative uses to build data-subject and data-holder trust

A new data infrastructure should have access to relevant existing national digital assets for the creation of essential aggregates. A infrastructure should mobilize and leverage data assets across different sectors.

Data from federal, state, tribal, territory, and local governments; the private sector; nonprofits and academic institutions; and crowdsourced and citizen-science data holders are crucial components of the 21st century data infrastructure. (Conclusion 4-1)

This infrastructure includes a wider variety of data holders, data subjects, data seekers, and data users than in the past. Thus, the need to demonstrate the benefits of expanded data sharing becomes even more important and a prerequisite for support.

Data sharing is incentivized when all data holders enjoy tangible benefits valuable to their missions, and when societal benefits are proportionate to possible costs and risks. (Conclusion 3-2)

Short-term actions to achieve this:

Seek researcher input regarding SAP implementation as an access tool
Monitor activities of ICSP working group on private-sector data
Monitor “data-connecting” pilots collecting data at the data holder’s site.
Convene a group to evaluate methods for documenting and possibly quantifying benefits and costs
Identify blended statistics generated by statistical agencies, document and possibly quantify benefits and costs
Monitor pilot projects for blended federal/state/local data
Publish criteria for prioritizing new data assets
Consider feasibility and means of covering some data-holder costs associated with data sharing

Medium-term actions to achieve this:

Monitor Data Hub’s use of SAP as an application, approval, and access tool
Access federal program/administrative data for statistical purposes, document benefits and costs
Access state, territory, tribal and local data for statistical purposes, document benefits and costs
Implement “data-connecting” learnings and technologies into statistical program production
Clarify data-governance access and use policies, rules, procedures incorporating learnings from short-term activities

Federal statistical agencies have the right, under the Evidence Act, to use federal program data for statistical uses only, unless directly prohibited by law. However, there are many laws and regulations that do prohibit federal statistical agencies from utilizing existing data for statistical purposes. The panel assumes the legislative and regulatory recommendations stemming from the Evidence Act will be initiated, but more needs to be done to bolster data safeguards and broaden data access.

Legal and regulatory changes are necessary to achieve the full promise of the 21st century national data infrastructure. (Conclusion 3-3)

Short-term actions to achieve this:

Legislation establishing the design, authorities, and funding for the NSDS
Implement Evidence Act regulations and rule making
Identify legislation/regulatory priorities regarding U.S. Commission on Evidence-Based Policymaking (CEP) state-related recommendations
Develop data synchronization bill legislative strategy
Identify legal options that would incentivize data holders to share data

Medium-term actions to achieve this:

Enact legal authorities for all necessary data-sharing entities
Adopt legal protections for private-sector data sharing
Introduce legislative strategies/priorities

The “data governance” framework includes guiding principles, authorities, structures, and directives. Data governance involves active stakeholder engagement, oversight protocols, open and transparent communications, and accountability. Standards in data definitions and access protocols are critical to provide interoperability across partners.

Effective data governance is critical and should be inclusive and accountable; governance policies and standards facilitating interoperability include key stakeholders and oversight bodies. (Conclusion 3-4)

Short-term actions to achieve this:

Convene potential data-sharing organizations
Document current practices in data access
Catalog current data platforms of potential data-sharing organizations
Document current ways data are curated, protected, and preserved
Document existing metadata practices
Identify priorities for standards development
Draft data sharing guidelines

Medium-term actions to achieve this:

Produce legislative language for governance procedures
Draft regulatory guidelines for practices
Convene relevant stakeholders to begin developing standards responding to infrastructure priorities
Establish governance roles and responsibilities

At any time, the public, data holders, and data subjects should be able to know how their data are used, by whom, for what purposes, and to what societal benefit.

Trust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders. (Conclusion 3-5)

Transparency enables the public to express concerns, seek redress and oversee compliance with the stated mission of the infrastructure. Transparency is also a prerequisite for public trust in the infrastructure and associated statistical products.

Short-term actions to achieve this:

Identify communication priorities regarding transparency
Sponsor public discussion regarding alternative oversight structures to achieve transparency
Engage stakeholders to evaluate alternative approaches

Medium-term actions to achieve this:

Implement communication strategy that responds to stakeholders’ priorities
Draft legislative language describing oversight vehicles to achieve transparency

New developments in remote access, cybersecurity, cryptography, and computational approaches are constantly emerging. Thus, the operations inside a data infrastructure must continually innovate and improve. Similarly, an infrastructure must have the talent to blend data together for more insightful research and statistical products. The acquisition, access and use of diverse data assets held by different organizations in different sectors will involve new partners with divergent experiences and expertise. This dynamism demands continuous refreshing of the data infrastructure staff skill mix.

The operations of a new data infrastructure would benefit from the inclusion of continually evolving practices, methods, technologies, and skills, to ethically leverage new technologies and advanced methods. (Conclusion 3-6)

Short-term actions to achieve this:

Exchange knowledge about needed staff skillsets to support new operations of infrastructure
Build communities of practice to catalyze the technical skills base
Develop professional culture within pilot projects for data protection
Develop organizational procedures for continuous updating of tools and practices

Medium-term actions to achieve this:

Develop new partnerships across sectors to provide technical skills for all organizations involved
Continuously update procedures and practices to achieve goals of infrastructure

Multiple Organizational Structures Can Support a New Data Infrastructure

Ideas have been promoted for how to organize statistical operations within a new data infrastructure. The key new entity (or set of entities) needed is not a data warehouse, but rather a computational resource for linking data files in diverse ways to produce blended statistics. Several organizational models for this new entity were identified: inside the federal government, outside the federal government, or in a new public-private partnership. To identify the best option for the United States, the panel suggests the beginning of widespread dialogue involving the many stakeholders of a data infrastructure. Click to expand the bars corresponding with the short-term and medium-term actions to organizing statistical operations within a new data infrastructure.

Monitor America's DataHub Consortium capabilities regarding regional and sectoral data sharing
Clarify data infrastructure roles and responsibilities
Identify NSDS-provided services and capabilities
Clarify Federal Statistical Research Data Center (FSRDC) services and capabilities
Sponsor bipartisan, multisector dialogue on how best to govern private-sector data use for national statistical purposes
Expand voluntary private-sector data sharing for statistical uses

(If NSDS is a portal) Begin pilots for accessing private-sector data by connecting to data at data holder’s site.
(If new organization is developed for private-sector data) Begin building the necessary governance framework and support
Scale up data-blending pilots to test the responsiveness of data service organizational capabilities
Identify challenges to be addressed and document benefits accruing to data holders

Implications of Utilizing Blended Data* in a New Data Infrastructure

The Committee on National Statistics has completed two consensus reports addressing specific issues identified in this vision of a new national data infrastructure. In addition, areas for further work have been identified.

*Blended data occur when at least two different data assets are combined to produce statistical information.

Enhancing Survey Programs by Using Multiple Data Sources Managing Privacy and Confidentiality Risks in Blended Data Informed Consent Research Computing and Data Workforce

Enhancing Survey Programs by Using Multiple Data Sources

Using multiple data sources can improve national statistics, provide new resources for social and economic research, and promote data equity by:

Providing information to improve the quality of data sources
Giving additional information about survey respondents
Producing statistics for small populations
Creating data products directly from administrative data

Improving Data Quality

Combining information across data sources must be done carefully, with an understanding of the properties of each component dataset and the statistics resulting from their combination.

Relying on multiple sources can take advantage of the strengths of each source while compensating for its weaknesses.
A new framework of quality standards and guidelines is needed for evaluating the fitness for use of statistics produced from multiple data sources.

Enhancing Data Equity

The use of multiple data sources can benefit data equity—promoting the collection and use of data in which all populations, and especially those that have been historically underrepresented or misrepresented in the data record, are visible and accurately portrayed.

Multiple data sources can be used to assess and improve the coverage of underrepresented groups, and to enable the production of disaggregated statistics.
Linkage procedures may introduce biases because linkage errors can disproportionately affect members of some population subgroups. It is important to assess data-equity implications of record-linkage methods.
Development of standards for data equity would enhance efforts to improve data equity across the federal statistical system.

Important Considerations

Transparency and documentation of datasets and methods used to combine them are essential for producing trust in information created from multiple data sources, particularly as new types of data are used.

A new data infrastructure requires investment not only in data sources but also in the people who can work with those data. Beyond the technical challenges of developing new statistical methods, there are challenges for promoting data equity and public trust in integrated data. It will be important for statistical agencies to invest in personnel, training, and cyberinfrastructure.

To learn more about how using multiple data sources can enhance survey programs, see the second consensus study report in the series, Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources.

Managing Privacy and Confidentiality Risks in Blended Data:

Protecting privacy and ensuring confidentiality in data is a critical component of modernizing our national data infrastructure. The use of blended data - combining previously collected data sources - presents new considerations for responsible data stewardship.

Agencies, policymakers, data users, and data subjects need to recognize that any blended (or nonblended) data release that offers nontrivial usefulness introduces disclosure risks; it is not productive or correct to think of disclosure risks as a “yes or no” feature.
Data-release strategies need to balance disclosure risks with data usefulness. When usefulness is high, stakeholders may be willing to accept greater risks to realize the benefits. Agencies can use various disclosure-protection methods for differing data-analysis objectives, such as tiered access approaches.
Agencies, policymakers, data users, and data subjects need to recognize that any blended (or nonblended) data release that offers nontrivial usefulness introduces disclosure risks; it is not productive or correct to think of disclosure risks as a “yes or no” feature.
Data-release strategies need to balance disclosure risks with data usefulness. When usefulness is high, stakeholders may be willing to accept greater risks to realize the benefits. Agencies can use various disclosure-protection methods for differing data-analysis objectives, such as tiered access approaches.

The report also provides a framework for managing disclosure risks that accounts for the unique attributes of blended data and poses a series of questions to guide considered decision-making.

A Framework for Managing Disclosure Risks in Blended Data

What are the anticipated final products of data blending?
What are potential downstream uses of blended data?
What are potential considerations for disclosure risks and harms, and data usefulness?

What data sources are available to accomplish blending, and what are the interests of data holders?
What steps can be taken to reduce disclosure risks and enhance usefulness when compiling ingredient files?

What are the disclosure risks associated with procuring ingredient data?
What are the disclosure risk/usefulness trade-offs in the plan for accessing ingredient files?

When blending requires linking records from ingredient files, what linkage strategies can be used?
Are resultant blended data sufficiently useful to meet the blending objective?

What are the best-available scientific methods for disclosure limitation to accomplish the blended data objective, and are sufficient resources available to implement those methods?
How can stakeholders be engaged in the decision-making process?
What is the mitigation plan for confidentiality breaches?

How will agencies track data provenance and update files when beneficial?
What is the decision-making process for continuing access to or sunsetting the blended data product, and how do participating agencies contribute to those decisions?
How will agencies communicate decisions about disclosure management policies with stakeholders?

SOURCE: Panel generated.

Informed Consent:

Public concerns often exist regarding the use of personal information. Informed consent is intended to ensure data subject autonomy, but, in reality, it may not always provide satisfactory agency for individuals or organizations. Communicating risks and benefits can be difficult and complex.

Informed consent issues are amplified in blended data. Customary informed consent language may insufficiently describe disclosure risks or potential usefulness inherent in blended data. Disclosure risk/usefulness trade-offs are difficult to communicate to respondents at the time of collection (particularly collection of ingredient data) because both attributes can change over time. Challenges are compounded by differing policies among federal statistical agencies regarding informed consent.

Communicating the intended uses of data and determining subsequent acceptable disclosure risks needs to consider the needs and concerns of respondents, while also permitting a practicable approach to the management of blended data. Future work is needed in the area of informed consent to improve communication about intended use, disclosure risk/usefulness trade-offs, and potential harm. Relevant topics could include (a) ways to communicate (future) data use to data subjects (including blending of private-sector data), (b) processes by which persons or establishments can decide which data to share for a (future) purpose, (c) the effects of such decisions on management of disclosure risk/usefulness trade-offs for blended data, (d) the effects of release of personal data on confidentiality of data collected from the data subject’s community, and (e) ways to account for differing privacy preferences.

Research Computing and Data Workforce:

Researchers in statistical agencies, government, academia, and beyond increasingly depend on the professional skills of the RCD workforce to facilitate the use of vast and ever-evolving technical resources. RCD professionals work at the intersection of cyberinfrastructure, research, and data. Data blending, and especially privacy and confidentiality protections as part of the blending lifecycle, clearly depends on an adequate RCD workforce.

To fully meet current and future RCD needs in a new data infrastructure, organizations engaged in data blending need to address issues hampering the full development of a stable, competent RCD profession and workforce. Meeting these RCD needs is complex. On the one hand, agencies need to understand the role of RCD and recognize the growing need for RCD professionals given the volume of data, the rapid evolution of computing resources, and researchers’ general lack of experience or skills necessary to make full use of emerging tools and techniques. As examples of issues that agencies need to address, the RCD profession lacks standardized job titles, has poorly defined job descriptions, and typically disperses work across multiple units within resource organizations. Traditional information technology may not naturally accommodate RCD roles and responsibilities, which can make communicating emerging program staffing needs to human resources departments difficult. Additionally, recruitment, retention, and development of RCD professionals is challenging, in part because clear career paths are not evident and also due to a lack of certificate and degree programs and scalable.

On the other hand, agencies producing government data face several challenges. There is strong competition between the governmental and the private sectors for skilled staff, and starting salary disparities are significant. It is important to identify ways to make government employers more competitive when hiring RCD specialists. In the panel’s view, cultivating an RCD workforce within government agencies is of major importance for statistical agencies and the contractors who support them. This area deserves extensive, dedicated study. New research on privacy-enhancing technologies makes evident the need for developing specialized techniques, tools, education, workforce training, free software libraries, and applications. Work in this direction has begun. Nonetheless, identifying ways to improve the competitiveness of government employers to attract and retain this workforce is also critical.

Toward a 21st Century National Data Infrastructure

An Interactive Overview of CNSTAT’s Visioning Series

Credible statistical information is foundational to the functioning of democratic societies. Just as bridges and highways facilitate the transportation necessary for commerce, the national data infrastructure informs decisions by governments, business enterprises, and individuals.

Outcomes of a New Data Infrastructure

Strengthens national information resources

More timely and useful critical information

Issues of national importance are highlighted

Evidence-based policy informs decisions

Incentives for data holders

Improved legal protections allows the use of existing data resources for common-good

Transparent, high-trust environment

Multiple Organizational Structures Can Support a New Data Infrastructure

Implications of Utilizing Blended Data* in a New Data Infrastructure

Enhancing Survey Programs by Using Multiple Data Sources

Improving Data Quality

Enhancing Data Equity

Important Considerations

Managing Privacy and Confidentiality Risks in Blended Data:

A Framework for Managing Disclosure Risks in Blended Data

Informed Consent:

Research Computing and Data Workforce:

RESOURCES

What is CNSTAT?