Historically, the United States national data infrastructure has relied on the operations of the federal statistical system and its data assets. These statistics, created from surveys, were essential to what we know about the well-being of the society and economy. They also created an infrastructure for vital empirical social and economic sciences research. Like other infrastructure, people can easily take these essential statistics for granted. Only when they are threatened, do we recognize the need to protect them.
Declining survey participation poses a severe threat to the quality of statistical information. Yet, at the same time, the country has never produced a higher volume of digital data about the activities of individuals and businesses, which poses an opportunity.
A new data infrastructure would strengthen, improve, and transform the ways the U.S. uses and benefits from richer informational resources, providing new capabilities and much-needed capacity-building.
The nation’s information resources are strengthened by blending data from multiple data sources and employing new methods, designs, capabilities, technology, and tools.
Critical information for decision makers is made more timely, granular, and useful by expanding access to data from a broader set of data holders
Researchers illuminate issues of national importance by accessing existing national data assets.
Enhanced evidence-based policy analysis informs federal, state, tribal, territory, and local governments.
Data holders are incentivized to share data for statistical purposes, by providing them with tangible benefits that inform and improve their operations and activities.
A reformed legal and regulatory framework undergirds protections for both participants and authorities, permitting increased use of existing data resources for common-good statistical information.
The national data infrastructure operates in a high-trust environment, characterized by transparency, that balances expanded data use with strengthened privacy preservation and confidentiality protection, data security, legal compliance, and responsible and ethical data use.
The beneficial outcomes of the new national data infrastructure (mentioned above) are possible provided a new data infrastructure is guided by overarching principles or attributes. The seven key attributes the panel identified are described below. Click to expand the bars corresponding with each attribute to reveal the short-term and medium-term actions that the panel believes should be conducted to help achieve its vision.
In building a 21st century data infrastructure, early success may come first from integrating data that are readily easily available, demonstrating the utility of improved statistical information of national importance, and constructing effective partnerships for necessary legal change. (Conclusion 5-1)
The United States is capable of building this new national data infrastructure. With appropriate design of operations using the data, the American public and decision-makers would enjoy more timely, granular, and accurate information and a more robust research infrastructure.
The social benefits of statistical information need not come at the price of increased threats to individuals’ privacy and confidentiality. Any harm to individuals from building and operating this infrastructure should be minimized and inadvertent. New technologies and strong regulations can strengthen safeguards for individuals.
It is ethically necessary and technically possible to preserve privacy and fulfill confidentiality pledges regarding data while simultaneously expanding the statistical uses of diverse data sources. (Conclusion 3-1)
Short-term actions to achieve this:
Medium-term actions to achieve this:
Data infrastructure resources produce non-identifiable aggregates, estimates, and statistics to create useful information for society and decision-makers without harming individuals. Data infrastructure operations and decisions are consistent with professional principles and practices, ethical standards, conducted by organizations free of political interference, and managed to ensure privacy and security. Confidential data cannot be used for enforcement of any laws or regulations affecting any individual data subject.
Short-term actions to achieve this:
Medium-term actions to achieve this:
A new data infrastructure should have access to relevant existing national digital assets for the creation of essential aggregates. A infrastructure should mobilize and leverage data assets across different sectors.
Data from federal, state, tribal, territory, and local governments; the private sector; nonprofits and academic institutions; and crowdsourced and citizen-science data holders are crucial components of the 21st century data infrastructure. (Conclusion 4-1)
This infrastructure includes a wider variety of data holders, data subjects, data seekers, and data users than in the past. Thus, the need to demonstrate the benefits of expanded data sharing becomes even more important and a prerequisite for support.
Data sharing is incentivized when all data holders enjoy tangible benefits valuable to their missions, and when societal benefits are proportionate to possible costs and risks. (Conclusion 3-2)
Short-term actions to achieve this:
Medium-term actions to achieve this:
Federal statistical agencies have the right, under the Evidence Act, to use federal program data for statistical uses only, unless directly prohibited by law. However, there are many laws and regulations that do prohibit federal statistical agencies from utilizing existing data for statistical purposes. The panel assumes the legislative and regulatory recommendations stemming from the Evidence Act will be initiated, but more needs to be done to bolster data safeguards and broaden data access.
Legal and regulatory changes are necessary to achieve the full promise of the 21st century national data infrastructure. (Conclusion 3-3)
Short-term actions to achieve this:
Medium-term actions to achieve this:
The “data governance” framework includes guiding principles, authorities, structures, and directives. Data governance involves active stakeholder engagement, oversight protocols, open and transparent communications, and accountability. Standards in data definitions and access protocols are critical to provide interoperability across partners.
Effective data governance is critical and should be inclusive and accountable; governance policies and standards facilitating interoperability include key stakeholders and oversight bodies. (Conclusion 3-4)
Short-term actions to achieve this:
Medium-term actions to achieve this:
At any time, the public, data holders, and data subjects should be able to know how their data are used, by whom, for what purposes, and to what societal benefit.
Trust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders. (Conclusion 3-5)
Transparency enables the public to express concerns, seek redress and oversee compliance with the stated mission of the infrastructure. Transparency is also a prerequisite for public trust in the infrastructure and associated statistical products.
Short-term actions to achieve this:
Medium-term actions to achieve this:
New developments in remote access, cybersecurity, cryptography, and computational approaches are constantly emerging. Thus, the operations inside a data infrastructure must continually innovate and improve. Similarly, an infrastructure must have the talent to blend data together for more insightful research and statistical products. The acquisition, access and use of diverse data assets held by different organizations in different sectors will involve new partners with divergent experiences and expertise. This dynamism demands continuous refreshing of the data infrastructure staff skill mix.
The operations of a new data infrastructure would benefit from the inclusion of continually evolving practices, methods, technologies, and skills, to ethically leverage new technologies and advanced methods. (Conclusion 3-6)
Short-term actions to achieve this:
Medium-term actions to achieve this:
Ideas have been promoted for how to organize statistical operations within a new data infrastructure. The key new entity (or set of entities) needed is not a data warehouse, but rather a computational resource for linking data files in diverse ways to produce blended statistics. Several organizational models for this new entity were identified: inside the federal government, outside the federal government, or in a new public-private partnership. To identify the best option for the United States, the panel suggests the beginning of widespread dialogue involving the many stakeholders of a data infrastructure. Click to expand the bars corresponding with the short-term and medium-term actions to organizing statistical operations within a new data infrastructure.
The Committee on National Statistics has completed two consensus reports addressing specific issues identified in this vision of a new national data infrastructure. In addition, areas for further work have been identified.
*Blended data occur when at least two different data assets are combined to produce statistical information.
Using multiple data sources can improve national statistics, provide new resources for social and economic research, and promote data equity by:
Combining information across data sources must be done carefully, with an understanding of the properties of each component dataset and the statistics resulting from their combination.
The use of multiple data sources can benefit data equity—promoting the collection and use of data in which all populations, and especially those that have been historically underrepresented or misrepresented in the data record, are visible and accurately portrayed.
Transparency and documentation of datasets and methods used to combine them are essential for producing trust in information created from multiple data sources, particularly as new types of data are used.
A new data infrastructure requires investment not only in data sources but also in the people who can work with those data. Beyond the technical challenges of developing new statistical methods, there are challenges for promoting data equity and public trust in integrated data. It will be important for statistical agencies to invest in personnel, training, and cyberinfrastructure.
To learn more about how using multiple data sources can enhance survey programs, see the second consensus study report in the series, Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources.
Protecting privacy and ensuring confidentiality in data is a critical component of modernizing our national data infrastructure. The use of blended data - combining previously collected data sources - presents new considerations for responsible data stewardship.
The report also provides a framework for managing disclosure risks that accounts for the unique attributes of blended data and poses a series of questions to guide considered decision-making.
SOURCE: Panel generated.
Public concerns often exist regarding the use of personal information. Informed consent is intended to ensure data subject autonomy, but, in reality, it may not always provide satisfactory agency for individuals or organizations. Communicating risks and benefits can be difficult and complex.
Informed consent issues are amplified in blended data. Customary informed consent language may insufficiently describe disclosure risks or potential usefulness inherent in blended data. Disclosure risk/usefulness trade-offs are difficult to communicate to respondents at the time of collection (particularly collection of ingredient data) because both attributes can change over time. Challenges are compounded by differing policies among federal statistical agencies regarding informed consent.
Communicating the intended uses of data and determining subsequent acceptable disclosure risks needs to consider the needs and concerns of respondents, while also permitting a practicable approach to the management of blended data. Future work is needed in the area of informed consent to improve communication about intended use, disclosure risk/usefulness trade-offs, and potential harm. Relevant topics could include (a) ways to communicate (future) data use to data subjects (including blending of private-sector data), (b) processes by which persons or establishments can decide which data to share for a (future) purpose, (c) the effects of such decisions on management of disclosure risk/usefulness trade-offs for blended data, (d) the effects of release of personal data on confidentiality of data collected from the data subject’s community, and (e) ways to account for differing privacy preferences.
Researchers in statistical agencies, government, academia, and beyond increasingly depend on the professional skills of the RCD workforce to facilitate the use of vast and ever-evolving technical resources. RCD professionals work at the intersection of cyberinfrastructure, research, and data. Data blending, and especially privacy and confidentiality protections as part of the blending lifecycle, clearly depends on an adequate RCD workforce.
To fully meet current and future RCD needs in a new data infrastructure, organizations engaged in data blending need to address issues hampering the full development of a stable, competent RCD profession and workforce. Meeting these RCD needs is complex. On the one hand, agencies need to understand the role of RCD and recognize the growing need for RCD professionals given the volume of data, the rapid evolution of computing resources, and researchers’ general lack of experience or skills necessary to make full use of emerging tools and techniques. As examples of issues that agencies need to address, the RCD profession lacks standardized job titles, has poorly defined job descriptions, and typically disperses work across multiple units within resource organizations. Traditional information technology may not naturally accommodate RCD roles and responsibilities, which can make communicating emerging program staffing needs to human resources departments difficult. Additionally, recruitment, retention, and development of RCD professionals is challenging, in part because clear career paths are not evident and also due to a lack of certificate and degree programs and scalable.
On the other hand, agencies producing government data face several challenges. There is strong competition between the governmental and the private sectors for skilled staff, and starting salary disparities are significant. It is important to identify ways to make government employers more competitive when hiring RCD specialists. In the panel’s view, cultivating an RCD workforce within government agencies is of major importance for statistical agencies and the contractors who support them. This area deserves extensive, dedicated study. New research on privacy-enhancing technologies makes evident the need for developing specialized techniques, tools, education, workforce training, free software libraries, and applications. Work in this direction has begun. Nonetheless, identifying ways to improve the competitiveness of government employers to attract and retain this workforce is also critical.
Get the first report in this series:
Toward a 21st Century National Data Infrastructure: Mobilizing Data for the Common Good
Get the second report in this series:
Get the third report in this series:
The Committee on National Statistics is a unit at the National Academies of Sciences, Engineering, and Medicine whose mission is to provide advice to the federal government and the nation grounded in the current best scientific knowledge and practice that will lead to improved statistical methods and information upon which to base public policy. CNSTAT seeks to advance the quality of statistical information, contribute to the statistical policies and coordinating activities of the federal government, and help provide a forward-looking vision for the federal statistical system and national statistics more broadly in service of the public good.
Committee Process
The National Academies of Sciences, Engineering, and Medicine appointed a consensus panel to produce three complementary reports on topics to guide the development of a vision for a new data infrastructure for federal statistics and social and economic research in the 21st century.
More about the committee members, the committee process, and the activities supporting this project.
Sponsor
National Science Foundation