Credible statistical information is foundational to the functioning of democratic societies. Just as bridges and highways facilitate the transportation necessary for commerce, the national data infrastructure informs decisions by governments, business enterprises, and individuals.

Historically, the United States national data infrastructure has relied on the operations of the federal statistical system and its data assets. These statistics, created from surveys, were essential to what we know about the well-being of the society and economy. They also created an infrastructure for vital empirical social and economic sciences research. Like other infrastructure, people can easily take these essential statistics for granted. Only when they are threatened, do we recognize the need to protect them.

Declining survey participation poses a severe threat to the quality of statistical information. Yet, at the same time, the country has never produced a higher volume of digital data about the activities of individuals and businesses, which poses an opportunity.

Outcomes of a New Data Infrastructure

A new data infrastructure would strengthen, improve, and transform the ways the U.S. uses and benefits from richer informational resources, providing new capabilities and much-needed capacity-building.

icon

Strengthens national information resources

The nation’s information resources are strengthened by blending data from multiple data sources and employing new methods, designs, capabilities, technology, and tools.

More timely and useful critical information

Critical information for decision makers is made more timely, granular, and useful by expanding access to data from a broader set of data holders

Issues of national importance are highlighted

Researchers illuminate issues of national importance by accessing existing national data assets.

Evidence-based policy informs decisions

Enhanced evidence-based policy analysis informs federal, state, tribal, territory, and local governments.

Incentives for data holders

Data holders are incentivized to share data for statistical purposes, by providing them with tangible benefits that inform and improve their operations and activities.

Improved legal protections allows the use of existing data resources for common-good

A reformed legal and regulatory framework undergirds protections for both participants and authorities, permitting increased use of existing data resources for common-good statistical information.

Transparent, high-trust environment

The national data infrastructure operates in a high-trust environment, characterized by transparency, that balances expanded data use with strengthened privacy preservation and confidentiality protection, data security, legal compliance, and responsible and ethical data use.

7 Attributes of the 21st Century National Data Infrastructure Vision and a Roadmap to Build It

The beneficial outcomes of the new national data infrastructure (mentioned above) are possible provided a new data infrastructure is guided by overarching principles or attributes. The seven key attributes the panel identified are described below. Click to expand the bars corresponding with each attribute to reveal the short-term and medium-term actions that the panel believes should be conducted to help achieve its vision.

In building a 21st century data infrastructure, early success may come first from integrating data that are readily easily available, demonstrating the utility of improved statistical information of national importance, and constructing effective partnerships for necessary legal change. (Conclusion 5-1)

The United States is capable of building this new national data infrastructure. With appropriate design of operations using the data, the American public and decision-makers would enjoy more timely, granular, and accurate information and a more robust research infrastructure.

The social benefits of statistical information need not come at the price of increased threats to individuals’ privacy and confidentiality. Any harm to individuals from building and operating this infrastructure should be minimized and inadvertent. New technologies and strong regulations can strengthen safeguards for individuals.

It is ethically necessary and technically possible to preserve privacy and fulfill confidentiality pledges regarding data while simultaneously expanding the statistical uses of diverse data sources. (Conclusion 3-1)

Short-term actions to achieve this:

  1. Establish mechanisms to engage stakeholders (including data subjects and data holders) regarding data safeguard prerequisites for building trust
  2. Develop strategy for ensuring key data safeguards are communicated effectively and transparently; and
  3. Establish technical specifications of privacy-preserving and confidentiality-protecting designs

Medium-term actions to achieve this:

  1. Propose community and data-holder council to ensure data subjects’ interests are respected
  2. Publish safeguard procedures and mechanisms
  3. Publish privacy/confidentiality procedures
  4. Establish external council for safeguard oversight

Data infrastructure resources produce non-identifiable aggregates, estimates, and statistics to create useful information for society and decision-makers without harming individuals. Data infrastructure operations and decisions are consistent with professional principles and practices, ethical standards, conducted by organizations free of political interference, and managed to ensure privacy and security. Confidential data cannot be used for enforcement of any laws or regulations affecting any individual data subject.

Short-term actions to achieve this:

  1. Use pilots to promote wider understanding of “statistical uses”
  2. Convene stakeholders to determine how best to describe new statistical products and distinguish them from privacy-threatening initiatives
  3. Monitor outcomes of the new Standard Access Process (SAP) for research uses of shared data to demonstrate value
  4. Launch communication campaign about the value of research as a “statistical use” of data

Medium-term actions to achieve this:

  1. Improve legislative language describing value of statistical uses versus administrative uses to build data-subject and data-holder trust

A new data infrastructure should have access to relevant existing national digital assets for the creation of essential aggregates. A infrastructure should mobilize and leverage data assets across different sectors.

Data from federal, state, tribal, territory, and local governments; the private sector; nonprofits and academic institutions; and crowdsourced and citizen-science data holders are crucial components of the 21st century data infrastructure. (Conclusion 4-1)

This infrastructure includes a wider variety of data holders, data subjects, data seekers, and data users than in the past. Thus, the need to demonstrate the benefits of expanded data sharing becomes even more important and a prerequisite for support.

Data sharing is incentivized when all data holders enjoy tangible benefits valuable to their missions, and when societal benefits are proportionate to possible costs and risks. (Conclusion 3-2)

Short-term actions to achieve this:

  1. Seek researcher input regarding SAP implementation as an access tool
  2. Monitor activities of ICSP working group on private-sector data
  3. Monitor “data-connecting” pilots collecting data at the data holder’s site.
  4. Convene a group to evaluate methods for documenting and possibly quantifying benefits and costs
  5. Identify blended statistics generated by statistical agencies, document and possibly quantify benefits and costs
  6. Monitor pilot projects for blended federal/state/local data
  7. Publish criteria for prioritizing new data assets
  8. Consider feasibility and means of covering some data-holder costs associated with data sharing

Medium-term actions to achieve this:

  1. Monitor Data Hub’s use of SAP as an application, approval, and access tool
  2. Access federal program/administrative data for statistical purposes, document benefits and costs
  3. Access state, territory, tribal and local data for statistical purposes, document benefits and costs
  4. Implement “data-connecting” learnings and technologies into statistical program production
  5. Clarify data-governance access and use policies, rules, procedures incorporating learnings from short-term activities

Federal statistical agencies have the right, under the Evidence Act, to use federal program data for statistical uses only, unless directly prohibited by law. However, there are many laws and regulations that do prohibit federal statistical agencies from utilizing existing data for statistical purposes. The panel assumes the legislative and regulatory recommendations stemming from the Evidence Act will be initiated, but more needs to be done to bolster data safeguards and broaden data access.

Legal and regulatory changes are necessary to achieve the full promise of the 21st century national data infrastructure. (Conclusion 3-3)

Short-term actions to achieve this:

  1. Legislation establishing the design, authorities, and funding for the NSDS
  2. Implement Evidence Act regulations and rule making
  3. Identify legislation/regulatory priorities regarding U.S. Commission on Evidence-Based Policymaking (CEP) state-related recommendations
  4. Develop data synchronization bill legislative strategy
  5. Identify legal options that would incentivize data holders to share data

Medium-term actions to achieve this:

  1. Enact legal authorities for all necessary data-sharing entities
  2. Adopt legal protections for private-sector data sharing
  3. Introduce legislative strategies/priorities

The “data governance” framework includes guiding principles, authorities, structures, and directives. Data governance involves active stakeholder engagement, oversight protocols, open and transparent communications, and accountability. Standards in data definitions and access protocols are critical to provide interoperability across partners.

Effective data governance is critical and should be inclusive and accountable; governance policies and standards facilitating interoperability include key stakeholders and oversight bodies. (Conclusion 3-4)

Short-term actions to achieve this:

  1. Convene potential data-sharing organizations
  2. Document current practices in data access
  3. Catalog current data platforms of potential data-sharing organizations
  4. Document current ways data are curated, protected, and preserved
  5. Document existing metadata practices
  6. Identify priorities for standards development
  7. Draft data sharing guidelines

Medium-term actions to achieve this:

  1. Produce legislative language for governance procedures
  2. Draft regulatory guidelines for practices
  3. Convene relevant stakeholders to begin developing standards responding to infrastructure priorities
  4. Establish governance roles and responsibilities

At any time, the public, data holders, and data subjects should be able to know how their data are used, by whom, for what purposes, and to what societal benefit.

Trust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders. (Conclusion 3-5)

Transparency enables the public to express concerns, seek redress and oversee compliance with the stated mission of the infrastructure. Transparency is also a prerequisite for public trust in the infrastructure and associated statistical products.

Short-term actions to achieve this:

  1. Identify communication priorities regarding transparency
  2. Sponsor public discussion regarding alternative oversight structures to achieve transparency
  3. Engage stakeholders to evaluate alternative approaches

Medium-term actions to achieve this:

  1. Implement communication strategy that responds to stakeholders’ priorities
  2. Draft legislative language describing oversight vehicles to achieve transparency

New developments in remote access, cybersecurity, cryptography, and computational approaches are constantly emerging. Thus, the operations inside a data infrastructure must continually innovate and improve. Similarly, an infrastructure must have the talent to blend data together for more insightful research and statistical products. The acquisition, access and use of diverse data assets held by different organizations in different sectors will involve new partners with divergent experiences and expertise. This dynamism demands continuous refreshing of the data infrastructure staff skill mix.

The operations of a new data infrastructure would benefit from the inclusion of continually evolving practices, methods, technologies, and skills, to ethically leverage new technologies and advanced methods. (Conclusion 3-6)

Short-term actions to achieve this:

  1. Exchange knowledge about needed staff skillsets to support new operations of infrastructure
  2. Build communities of practice to catalyze the technical skills base
  3. Develop professional culture within pilot projects for data protection
  4. Develop organizational procedures for continuous updating of tools and practices

Medium-term actions to achieve this:

  1. Develop new partnerships across sectors to provide technical skills for all organizations involved
  2. Continuously update procedures and practices to achieve goals of infrastructure

Multiple Organizational Structures Can Support a New Data Infrastructure

Ideas have been promoted for how to organize statistical operations within a new data infrastructure. The key new entity (or set of entities) needed is not a data warehouse, but rather a computational resource for linking data files in diverse ways to produce blended statistics. Several organizational models for this new entity were identified: inside the federal government, outside the federal government, or in a new public-private partnership. To identify the best option for the United States, the panel suggests the beginning of widespread dialogue involving the many stakeholders of a data infrastructure. Click to expand the bars corresponding with the short-term and medium-term actions to organizing statistical operations within a new data infrastructure.

  1. Monitor America's DataHub Consortium capabilities regarding regional and sectoral data sharing
  2. Clarify data infrastructure roles and responsibilities
  3. Identify NSDS-provided services and capabilities
  4. Clarify Federal Statistical Research Data Center (FSRDC) services and capabilities
  5. Sponsor bipartisan, multisector dialogue on how best to govern private-sector data use for national statistical purposes
  6. Expand voluntary private-sector data sharing for statistical uses

  • (If NSDS is a portal) Begin pilots for accessing private-sector data by connecting to data at data holder’s site.
  • (If new organization is developed for private-sector data) Begin building the necessary governance framework and support
  • Scale up data-blending pilots to test the responsiveness of data service organizational capabilities
  • Identify challenges to be addressed and document benefits accruing to data holders

Implications of Utilizing Blended Data* in a New Data Infrastructure

*Blended data occur when at least two different data assets are combined to produce statistical information.

icon

icon Data Infrastructure Capabilities

Features of the 20th-century data infrastructure must change to achieve the panel’s vision for a new data infrastructure. As an increasing number of initiatives occur, combining data from multiple independent sources, further desirable capabilities of a new data infrastructure are articulated. The box below lists work by the United Nations’ Economic Commission on Europe’s High-Level Data Group for Modernization of Statistical Production and Services related to a Common Statistical Data Architecture (CSDA), an initiative aimed at consistently describing the data aspects of statistical production. The group identified high-level capabilities required by a new data infrastructure to realize the promise of blending multiple data sources. Capabilities require the interaction of organizations, people, processes, and technology and generally describe the “what and why” of statistical production, not the “how and who.”

The new data infrastructure will require the enhanced capabilities, including:
  1. Data design, definition, and description of data not originally built for statistical analysis
  2. Data logistics, managing supply chains between data holders and data users
  3. Data sharing support, accessing data from and returning statistical information to partners
  4. Data transformation, the ability to transform data to make them suitable for specific uses and purposes
  5. Data integration, the ability to combine, link, relate, and/or align data assets from multiple sources
  6. Data governance, the ability to manage data assets by defining and enforcing established policies, processes, and rules in accordance with strategic objectives
  7. Security and data assurance, protecting and maintaining the data assets, at rest and in-transit
  8. Provenance and lineage, tracking the edition and source of a given version of a data asset
  9. Knowledge management, documenting the meaning of individual measurements on data assets
Source: Adapted from United Nations Economic Commission for Europe on behalf of the international statistical community: https://statswiki.unece.org/display/DA/CSDA+2.0. Reproduced under Creative Commons Attribution 3.0 International License (https://creativecommons.org/licenses/by/3.0/).

A new data infrastructure will require enhanced capabilities. While there is much existent talent for documenting data designed for statistical uses (e.g., surveys and censuses), there is less expertise available for documenting those features of administrative and process data that were never intended to be used in statistical operations. Similarly, the box above notes the need to define and track supply chains of data, to access data in diverse locations simultaneously, and to work with a set of partners deserving ongoing support. Data integration is well exercised in some organizations, but not all, especially for data sets that were not originally designed to be used in tandem. While data governance is well documented in federal statistical agencies, it was originally designed for data assets that would be fully acquired and stored behind the agencies’ firewalls, not for a world in which data assets are too large to move from organization to organization. Finally, knowledge management—the ability to understand differences among measures found in multiple data sets—is critical for the statistical operations needed to blend data into more informative estimates.

icon Privacy and Ethical Implications:

A 21st-century national data infrastructure cannot succeed without:

  • ethical exchange of data
  • trust in institutions involved in data exchange
  • privacy-preserving techniques
  • technical, organizational, and legal mechanisms supporting responsible data practices
Ethical treatment of data subjects requires adherence to four key values:
  • the actions of a new infrastructure should be guided by attention to how use of a subject’s data will affect that subject’s life.
  • there are underlying issues of autonomy—the ability of individuals to make their own decisions. A new data infrastructure must recognize the nature of informed consent by the data subject.
  • there is a concern about beneficence—that is, to what extent will the data be used to produce good outcomes for the data subject?
  • there is a focus on human dignity—that is, are the activities of a new infrastructure conducted in a manner that is respectful of data subjects? Collectively, in the panel’s opinion, these values must underlie both policy and practice.

In the 20th century, lawmakers passed numerous bills restricting government data collection and use. While federal data holders must only concern themselves with federal laws, multinational corporations must grapple with privacy laws in numerous countries. While some companies may be required to share data (e.g., pollution levels from a manufacturer), others operate under voluntary agreements between users and data holders (e.g., credit-reporting data), and still others are legally prohibited from sharing certain kinds of data without explicit consent (e.g., healthcare, movie rentals). The legal procedures covering privacy are both complex and incomplete. Most importantly, the limitations of privacy laws infuriate data subjects, data holders, and data users for wholly distinct reasons.

Advances in computing have increased privacy-related risks while also enabling the development of privacy-enhancing technologies.

Privacy laws and technologies can help strengthen data protections. Tracking these technical mechanisms and integrating them into practice will increase data holders’ confidence in sharing data.

Even when data are de-identified, linking sources increases risk of re-identification. Merely connecting the site of an individualized record to the same site of another can inadvertently reveal personally identifiable information that was obscured prior to blending. Methodologies to balance privacy tradeoffs, such as geomasking, can address the need to protect individuals while still enabling individual-level data to be utilized or analyzed in a way that does not significantly affect statistical results.

A new data infrastructure must be attentive to—and in conversation with—the range of stakeholders engaging on topics such as trustworthy AI, data ethics, and data equity. While useful, privacy laws and technologies alone will not serve as an effective response to threats that could challenge the legitimacy of a new data infrastructure. Rather, all who are involved—including data subjects, data holders, and data users—must collectively negotiate best practices, governance mechanisms, and normative expectations about data exchange. This requires creating and sustaining a governing body (or set of bodies) tasked with building process and practices, sustaining relationships with stakeholders, and ensuring that trade-offs are collectively negotiated.

RESOURCES

What is CNSTAT?

The Committee on National Statistics is a unit at the National Academies of Sciences, Engineering, and Medicine whose mission is to provide advice to the federal government and the nation grounded in the current best scientific knowledge and practice that will lead to improved statistical methods and information upon which to base public policy. CNSTAT seeks to advance the quality of statistical information, contribute to the statistical policies and coordinating activities of the federal government, and help provide a forward-looking vision for the federal statistical system and national statistics more broadly in service of the public good.

Committee Process

The National Academies of Sciences, Engineering, and Medicine appointed a consensus panel to produce three complementary reports on topics to guide the development of a vision for a new data infrastructure for federal statistics and social and economic research in the 21st century.

More about the committee members, the committee process, and the activities supporting this project.

Sponsor
National Science Foundation

SHARE THIS PAGE