A Vision for a New National Data Infrastructure
Statistical agencies are largely siloed, they heavily rely on surveys as the primary source of data, and they independently identify and negotiate access to new data assets. Blending survey data with other data sources is the exception, not the rule, and data products that incorporate blended data are rare. Laws and regulations remain major obstacles to accessing and using federal statistical, program, and administrative data, as well as state, tribal, territory, and local government data. Most data holders have no incentives to contribute or share their data for the common good.
Statistical agencies depend primarily on survey data, with limited use of administrative and state-provided or local government data, and efforts to use private sector data are largely uncoordinated. Information flows are unidirectional (from data holders to statistical agencies), with only limited data sharing among statistical agencies. Currently, there is little cross-agency access to data and minimal, if any, access to federal and state administrative data for evidence building. While there are 31 federal statistical research data centers (FSRDCs)—highly secure sites for approved in-person and virtual research using federal statistical agency data—located across the United States, statistical agency participation is not universal. These FSRDCs, importantly, facilitate collaborative agreements between more than a hundred universities, the entire Federal Reserve System, and four principal statistical agencies. However, no statistical organization has overall responsibility for coordinating data access and use among data holders, statistical agencies, and data users. In the panel’s view, the mismatch between the vision of a 21st century national data infrastructure and its current state is stark, and the gap will continue
to grow unless attention is focused on identifying and implementing important attributes for a new data infrastructure.
This chapter begins by describing a vision for a new data infrastructure, the expected outcomes, and a description of seven key data-infrastructure attributes. As highways and bridges have little value without facilitating commerce and social interaction, a data infrastructure gains its value by its use, permitting organizations, entities, and functions that use it and its products to build better information.
VISION FOR A 21ST CENTURY NATIONAL DATA INFRASTRUCTURE
The panel’s vision for a new data infrastructure assumes that statistical agencies and other approved users (federal, state, tribal, territory, and local government employees and researchers) will access and use data assets relevant to the nation’s information and research needs for the common good. These include data assets of federal statistical, program, and administrative agencies; state, tribal, territory, and local governments; private sector companies; nonprofit, tax-exempt, and academic institutions; and crowdsourced and citizen-science data. Operations using a new infrastructure would blend multiple data sources to improve the quality, timeliness, granularity, and usefulness of national statistics; facilitate more rigorous social and economic research; and support evidence-based policymaking and program evaluation. Access and use for solely statistical purposes will require approval, will comply with existing laws and regulations, and will be governed by established policies, rules, and procedures.
In the panel’s vision, explicit values will guide the operations of a new data infrastructure and decisions relating to its use. Primary among these values is respecting and protecting data subjects and data holders. Other values are articulated in Principles and Practices for a Federal Statistical Agency (the National Academies of Sciences, Engineering, and Medicine, 2021) and include producing information relevant to societal issues, credibility among users, public trust, independence from political influence, and continual innovation. To achieve these values, strengthened data safeguards will secure data, preserve privacy, and protect confidentiality of data subjects. These safeguards will minimize the harm to any individual or data subject, but will maximize the infrastructure’s benefits to society.
Safeguard mechanisms and measures will be communicated widely—in the panel’s vision, transparency is central to the success of a new data infrastructure. The public, data holders, and data subjects will be able to see how their data are used, by whom, for what purposes, and to what societal benefit. This will generate confidence that data are used responsibly, ethically, and only for approved statistical purposes.
Data infrastructure operations and decisions must, in the panel’s judgment, be consistent with professional principles and practices and with ethical standards, and must be autonomous and free of political interference.1 The data infrastructure must be inclusive—the public, data holders, data subjects, and other important constituencies must be engaged in standards development, data governance, and other pertinent decisions, strengthening trust in the data infrastructure. The new data infrastructure must not only provide tangible benefits for the common good, but also ensure societal benefits are proportionate to the possible costs and risks of acquiring and using data assets.
In the panel’s vision, a new data infrastructure must also support two-way information flows: from data holders to statistical agencies and from statistical agencies to data holders. Statistical agencies must return useful information and services to data holders to inform the data holders’ decisions, operations, and activities. In turn, the public, data holders, and key stakeholders must support legislation and other changes that facilitate and support expanded data access and use.
Outcomes of a New Data Infrastructure
According to the panel’s vision, a new data infrastructure would strengthen, improve, and transform the ways the United States uses and benefits from richer informational resources, providing new capabilities and much-needed capacity building (see Box 3-1).
Key Attributes of a New Data Infrastructure
The stark differences between the state of the current data infrastructure and the panel’s vision for a new data infrastructure prevent the United States from effectively utilizing vast amounts of available data that could provide more timely, granular, and useful information to support more rigorous research. These data could also support evidence-based evaluations and analysis of federal, tribal, state, and local governments’ policies and programs. This mismatch between the current reality and the panel’s vision will continue to grow, in the panel’s opinion, unless attention is focused on establishing a new data infrastructure with a set of key attributes.
Box 3-2 presents seven key attributes of a 21st century national data infrastructure, as envisioned by the panel. The remainder of the chapter
1 In this chapter and throughout the report, when the panel discusses “professional principles and practices” in compiling national statistics—either within the existing federal statistical system or a new data infrastructure—it considers the National Academies (2021) as an authoritative source of such information.
describes changes necessary to produce the required attributes supporting a new data infrastructure.
ATTRIBUTES OF A NEW DATA INFRASTRUCTURE
Attribute 1: Safeguards and Advanced Privacy-Enhancing Practices to Minimize Possible Individual Harm
Per the panel’s vision, a new data infrastructure needs strong, uniform privacy protections for data and needs to account for the presence, interests, and rights of data subjects. The current infrastructure does neither, in the panel’s judgment. For example, the protection of the autonomy of individuals and data subjects to oversee which data are accessed and how they are used is vastly different between the private and government sectors, and there is a wide diversity in privacy protections across states. Federal program agencies, which use their data resources mainly for administrative purposes, have scattered, diverse, and occasionally arcane regulations. It is not surprising, therefore, that United States residents have little understanding of the data protections that exist under a given set of circumstances or regarding a specific organization.
In the panel’s opinion, a new data infrastructure should be oriented around the data subjects and the ways digital data connected to those subjects are affecting their lives (Beauchamp and Childress, 2001). First, the infrastructure must avoid causing personal harm. Individual data should only be combined to produce aggregate descriptions of groups or to help model estimates. Access to individual records must not be allowed.
Second, the panel argues that a new data infrastructure must address underlying issues of autonomy—the ability of data subjects to make their own decisions. As a new data infrastructure is developed, how will individuals and the community at large be involved with the creation of consent procedures relevant to blending multiple data sources to produce aggregate statistics? The infrastructure must recognize that data subjects may be unaware of informed consent requirements, have differing privacy stances, and may grapple with the ethical dimensions of data usage. Just because data can be accessed does not mean that they should be used.
Third, the panel believes that the public should have a beneficent view of a new data infrastructure. Data subjects need to understand how data will be used to produce good outcomes for them. The goal of a new data infrastructure must be to provide information on critical societal and economic topics; thus, the ability to produce individual benefits derived from improved information about the country’s health, education, income, and safety will be the cornerstone of the infrastructure’s appeal. To engender trust in a new data infrastructure, it is imperative that the government
actively communicate with the public about the importance of the infrastructure and its activities.
Fourth, in the panel’s vision, a new data infrastructure must reinforce human dignity. That is, the uses of the new infrastructure should be respectful of the data subjects. The infrastructure’s key values must place the data subject at the center, even as the infrastructure focuses on the societal benefits that can be derived from using the subject’s data to produce aggregate statistical information for the good of society.
In short, accounting for data subjects and their rights, minimizing possible harm, and broadly engaging communities are ethical necessities that will contribute to the legitimacy and trust-building required for a successful new data infrastructure. Privacy and security are key features of a trusted system; data are safeguarded and secured, while privacy is preserved and confidentiality protected. Thus, a new data infrastructure will implement a strong set of safeguards to assure data will not be used to harm any individual or data subject. A combination of modern cybersecurity, encryption, and secure access protocols can provide greatly enhanced data security, while new privacy-enhancing technologies are essential for protecting the confidentiality of subjects’ data. If strong protections are in place, the panel notes that the societal benefits of statistical information critical to society’s welfare will not come at the price of increased threats to privacy and confidentiality. Minimal collection of new data is envisioned for a new data infrastructure; instead, such an infrastructure would involve expanded and responsible use of existing data to produce high-quality information about crucial societal features.
Most existing legal protections of personal and organizational data lie at the federal government level. The Confidential Information Protection and Statistical Efficiency Act (CIPSEA) was first enacted as Title V of the E-Government Act of 2002 (U.S. Congress, 2002a). For all data furnished by individuals or organizations to an agency under a pledge of confidentiality, Subtitle A provides that the data will be used only for statistical purposes and will not be disclosed in identifiable form to anyone not authorized by the title. CIPSEA makes the knowing, willful disclosure of confidential statistical data a class E felony, with fines of up to $250,000 and imprisonment for up to five years. Data held by the U.S. Census Bureau are protected by Title 13, which specifies similar protections (U.S. Census Bureau, n.d.). Appendix 3A, taken from Principles and Practices for a Federal Statistical Agency (the National Academies, 2021), reviews the various laws affecting the operations of the federal agencies.
Existing laws for federal statistical agencies imply that protecting data can offer a data subject complete certainty of nondisclosure. This contradicts the modern understanding that the risk of disclosure can never be driven to zero. Modern privacy-protecting procedures offer ways to
measure and limit the risk of disclosing or reconstructing an individual’s data. Further, cybersecurity methods continue to evolve to offer greater protections. Finally, sustainable methods have evolved for using data for research and solely statistical purposes, without violating confidentiality pledges given by federal statistical agencies (for example the FSRDC network).2 A new data infrastructure can take advantages of these discoveries in mounting its privacy-protecting framework.
The panel also observed large variations in current data protection. Some data resources relevant to social and economic statistics are well protected, while others are not. As the construction of a new data infrastructure proceeds, stronger, more uniform data protections will necessarily be built, which will improve the privacy of existing data resources. Thus, a new data infrastructure could be a net gain to the privacy protections of the United States population and enterprises. The privacy and ethical implications of blended data are explored in Chapter 4, as this will be a critical feature of a new data infrastructure.
It is ethically necessary and technically possible to preserve privacy and fulfill confidentiality pledges regarding data while simultaneously expanding the statistical uses of diverse data sources. (Conclusion 3-1)
Attribute 2: Statistical Uses Only, for Common-Good Information, with Statistical Aggregates Freely Shared with All
In the panel’s opinion, the sustainability of a new national data infrastructure rests on its use for statistical purposes only. In essence, a new data infrastructure has the sole aim of serving the common good of society; its operations and decisions are consistent with professional principles and practices, conform to ethical standards, and are autonomous and free of political interference. In the panel’s vision of a new infrastructure, only statistical aggregates, estimates, or synthetic microdata would be released, describing the state of health, income, education, employment, safety, transportation, business, food security, environment, and housing. The data assets would be available for advanced research to guide improvements in each of these sectors. The statistical information produced would be freely shared with all parts of society and would help shape the national discourse, informing the practical actions of decisionmakers in all sectors and guiding institutional policies at all levels.
No use of the data should allow identification of an individual, household, business, or specific data subject, in the panel’s opinion. No individual records used for statistical purposes should be accessible to any
2 For more information, see: https://www.census.gov/about/adrm/fsrdc.html
governmental or law enforcement agency. Confidential data should not be used as part of any intervention or for enforcement of any laws or regulations affecting a data subject. In the panel’s view, creation of statistical information that is useful for society and decisionmakers must be the only goal of a new data infrastructure. The Foundations for Evidence-Based Policymaking Act (hereafter, Evidence Act) defines statistical purpose to include “description, estimation, or analysis of the characteristics of groups, without identifying the individuals or organizations that comprise such groups; and includes the development, implementation, or maintenance of methods, technical or administrative procedures, or information resources that support the [statistical] purpose” (U.S. Congress, 2019, Section 3561 (12)).
In contrast, the Evidence Act defines a nonstatistical purpose as using data in identifiable form for:
…administrative, regulatory, law enforcement, adjudicatory, or other purposes that affects the rights, privileges, and benefits of a particular identifiable respondent and includes the disclosure under section 552 of title 5 (popularly known as the Freedom of Information Act) of data that are acquired for exclusively statistical purposes under a pledge of confidentiality (U.S. Congress, 2019, Section 3561 (8)).
The panel stands by both definitions in arguing for statistical uses only. To clarify, statistical uses of collected or acquired income data, for example, would include calculating aggregate statistics, computing income distributions, or showing the number of households below the poverty line. Such uses would include constructing models to produce estimates or to construct synthetic populations. The statistical-only purposes of a new data infrastructure would not prevent a federal administrative agency from collecting the same income data and using it to determine program eligibility—to grant or deny benefits to individuals based on income. However, that agency could not obtain such records directly from a new data infrastructure.
Determinations of statistical purposes for federal statistical agencies have, generally, been relatively straightforward, but the Evidence Act has added a wrinkle relevant to a new national data infrastructure. The Evidence Act provides access to data assets “for purposes of developing evidence,” where “evidence” is “information produced as a result of statistical activities conducted for a statistical purpose” (U.S. Congress, 2019, Section 3561(6)).
As mentioned in Chapter 2, the U.S. Office of Management and Budget (OMB) memorandum M-19-23 (U.S. Office of Management and Budget, 2019) provides federal agencies with additional guidance regarding Evidence Act implementation and defines “evidence” broadly. OMB guidance suggests
the following legitimate statistical purposes: statistical production and dissemination (aggregate statistics), approved research, policy analysis, program evaluation, and performance measurement of key data infrastructure operations and activities. The panel notes that all of these are uses of statistical data appropriate for a new data infrastructure.
Federal statistical agencies have legally sanctioned missions limited to statistical uses of data. The panel sees the statistical agencies identified in Box 3-3 playing an important role in a new data infrastructure, but they should not be the only participants, in the panel’s estimation. Data holders, data users, researchers, and entities involved in data infrastructure
coordination, collaboration, governance, and accountability will all have important roles in realizing the promise of a new data infrastructure. New actors will only be permitted data access if the data are to be used for approved statistical purposes. Moreover, data infrastructure operations and decisions will be consistent with professional principles and practices, as previously explained.
Attribute 3: Mobilization of Relevant National Digital Data Assets, Blended in Statistical Aggregates to Provide Benefits to Data Holders, with Societal Benefits Proportionate to Possible Costs and Risks
To achieve the panel’s vision, a new national data infrastructure would provide access to a variety of currently collected digital assets when those data are relevant to the nation’s information and research needs. Thus, in the panel’s vision, a new data infrastructure will mobilize and leverage the strategic value of national data resources in a coordinated way (Box 3-4). A new data infrastructure will gather information from all relevant sectors to assist the development of society as a whole. These data-holder assets are fully described in Chapter 4, along with their strengths, weaknesses, and any statutory limitations on their access and use.
As the blending of data becomes a key component of a new data infrastructure, the infrastructure will include a wider variety of data holders, data subjects, data seekers, and data users than in the past, necessitating new relationships, partnerships, and collaborations. Thus, the need to demonstrate the benefits of expanded data sharing to diverse data holders and important stakeholders becomes a prerequisite for the success of a new data infrastructure. State, tribal, territory, and local governments, along
with private sector companies and other data holders, will be more likely to share their data assets for approved statistical purposes if they understand the tangible benefits that expanded sharing provides both to themselves and to society.
At the National Academies’ December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure, the panel was presented with potential benefits of a new data infrastructure, including the promise of more timely, better quality, and more granular statistics that could answer questions of national interest, support more rigorous research, and facilitate evidence-based policymaking. However, in the panel’s opinion, benefits should go beyond improved statistics to include reciprocal information sharing, in which tailored insights extracted from data assets and analysis flow back to data holders, informing their activities and operations.
Direct Benefits to Data Holders of Sharing Data for National Statistical Purposes
The panel reviewed a set of benefits that could sustainably be offered to data holders in return for access to their data for national statistical needs. First, it is of great interest to many private sector firms and administrative units that data sharing be consistent with user agreements pledged to clients whose data they hold. Nathan Persily, workshop participant, noted that the proposed Platform Transparency and Accountability Act3 provides qualified platforms with limited legal liability if they comply with the act’s specified privacy and cybersecurity provisions. A similar liability provision could incentivize companies and other data holders to share their data by protecting them against possible legal threats related to data sharing.
Second, all organizations are facing the continuous threat of intruders into their data systems. Small- and medium-sized private sector firms might profit from a set of continuously updated cybersecurity services provided by a new data infrastructure that would improve protection of the firms’ core business information, concomitant with the sharing of their data. Such a benefit might serve two purposes: the strengthening of privacy protections for data accessed by the infrastructure, and a tangible incentive to private sector firms that assuages concerns about breaches that might result from a data-sharing agreement.
Third, when businesses can compare their performance to other businesses in their sectors (e.g., using standard product definitions, industrial classification definitions, definitions of urban areas), clear value results.
3 The proposed law has not been enacted as of June 2022. See Wright (2021) for more information.
If statistical information were available on all organizations in a sector, business officials could benchmark their activities and easily compare the performance of their enterprise to that of similar enterprises. The standard classification systems have informed organization decisionmaking, and, to the extent that common definitions, standards, and classification can evolve through a new data infrastructure, participating organizations could interoperate with one another and compare data discoveries. Such curation benefits are discussed later in this chapter.
Fourth, according to the panel, financial incentives must be considered. Federal government requests for information impose a burden on respondents. This burden is the foundation of the Paperwork Reduction Act requirements stating that federal agencies must publish the justification for, and the estimated burden related to, their information requests (U.S. Congress, 1995). Currently, the ongoing efforts of blending private sector data with other data resources generally involve payment of funds from the data-seeking organization to the data holder. For some data-holding organizations, the fees simply cover transaction costs. However, other data holders regard these fees as a revenue stream that is part of their business model. Tax incentives to reduce total costs incurred by data sharing are another financial incentive that could be considered for data holders in a new data infrastructure.
Fifth, private sector organizations are constantly searching for useful information about the future of their enterprises and changes in their customer bases. Participation in a new data infrastructure could influence the production of more timely statistical products than those currently available. Given the common-good purposes of a new data infrastructure as envisioned by the panel, new statistical products would be shared publicly and not merely provided to those who share their data, but the infrastructure should look for ways to formally acknowledge data sharing. However, a new data infrastructure could usher in a new era of information flow back to businesses, nonprofits, and government organizations. Timely, granular statistical information to assist with modern forecasting analyses could be of ubiquitous benefit to growing the economy and improving society. Such a new era would begin with the promise that a new data infrastructure would be designed to benefit data subjects, data holders, data users, and society as a whole.
Another way to incentivize data holders is to ensure that the societal benefits are proportionate to the possible costs and risks of sharing their data assets. Such costs and risks are of major concern for data holders. For example, Statistics Canada’s proposal to use private sector financial information for statistical purposes, though perfectly legal, resulted in such a public outcry that the project was abandoned, and Statistics Canada added “sensitivity” to its “necessity and proportionality” framework (Bowlby,
2021). Ensuring that the benefits of data sharing are proportionate to the associated costs and risks will require a new data infrastructure to identify and quantify both benefits and costs incurred by the data holder and the data-seeking organization. Identification of the potential benefits of expanded data access and use will also be important for building public support for these activities. To build public support effectively, the panel recommends that operations within a new data infrastructure be transparent regarding tangible benefits and costs/risks incurred.
Data sharing is incentivized when all data holders enjoy tangible benefits valuable to their missions, and when societal benefits are proportionate to possible costs and risks. (Conclusion 3-2)
Attribute 4: Reformed Legal Authorities Protecting All Parties’ Interests
In the panel’s judgment, a new data infrastructure needs to rest on a legal and regulatory framework that clearly defines which data assets can be shared, with whom, and for what purposes. The current infrastructure is far from the ideal described earlier. The U.S. statistical system is decentralized and governed by multiple statutes. There is no single legal framework controlling access and use of government data assets. For data-acquiring federal agencies, agency authorities determine the requirements and protections that govern the use of a given data asset. In some cases, statutes explicitly address access and use while, in other cases, agencies interpret their statutory authority to implicitly permit data sharing or access by external actors.4 Clearly, legislative and regulatory changes are needed to support expanded data access with strengthened privacy protections.
In terms of access to federal data, the Evidence Act represents a major advance. First, the act made confidentiality requirements more consistent across statistical agencies and units and strengthened confidentiality requirements for many others. The act gave the director of OMB authority to designate agencies or organizational units as statistical agencies (U.S. Congress, 2019, Section 3562). Second, the act directed that federal-program and administrative-agency data assets be shared with statistical agencies unless explicitly prohibited by law. This expanded statistical agencies’ access to the data of some federal administrative agencies, for statistical purposes (U.S. Congress, 2019, Section 3581), and permitted federal statistical agencies to share their own statistical data with each other. The act also expanded secure
4 IRS Title 26, 6103(j) is an example of an explicit statute limiting access to tax data with explicit exceptions, while the U.S. Department of Agriculture has interpreted its statutory authority to implicitly permit the sharing of Supplemental Nutrition Assitance Program data for statistical purposes. See: https://uscode.house.gov/quicksearch/get.plx?title=26§ion=6103
access to CIPSEA data for statistical purposes, including evidence building, to the extent practicable and unless prohibited by law (U.S. Congress, 2019, Section 3582). Unfortunately, at this writing, the director of OMB has not issued related rules and regulations to advance data sharing among statistical agencies.
However, there are many valuable government data assets whose use for statistical purposes is limited. For example, the Evidence Act’s CIPSEA 2018 amendment (Evidence Act, Part B) did not revise Internal Revenue Service regulation 6103(j) to permit the U.S. Census Bureau to share limited business tax data with the Bureau of Labor Statistics (BLS) and the Bureau of Economic Analysis. State-collected and maintained high-value administrative data assets, including those associated with federally funded programs, also have statutory restrictions on access and use.
The Evidence Act addressed only federal data. The web of legal and regulatory limitations on state and local government data assets is even more complex than the federal framework. State laws may impose restrictions and obligations on businesses and institutions relating to the collection, use, and disclosure of information about its residents. Recently, some states (e.g., California, Virginia, and Colorado) have enacted comprehensive consumer privacy and protection laws.5 New York has enacted the Stop Hacks and Improve Electronic Data Security (SHIELD) Act, which amended the existing data-breech notification law and imposed additional data-security requirements on companies.6 According to the International Comparative Legal Guides, some 20 other states are considering comprehensive privacy laws.7 As the privacy landscape continues to evolve, the panel advises that a new data infrastructure be responsive to these changes.
Even in the absence of prohibitive state laws, gaining access to state administrative data can be a time-consuming and daunting task that may not provide information in a timely enough manner to prevent collective harm following natural disasters or other shocks. For example, the creation of the Longitudinal Employer Household Dynamics program,8 which combines administrative data on business establishments and workers with household- and business survey based data, took more than a decade to implement and still relies on scores of individual memoranda of understanding (MOUs) that must be regularly renegotiated with individual states.
Federal data-privacy laws also limit which federal data can be shared.
5 See: California Consumer Privacy Act, https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5; Virginia Consumer Data Protection Act: https://lis.virginia.gov/cgibin/legp604.exe?211+sum+SB1392; Colorado Privacy Act: https://leg.colorado.gov/sites/default/files/2021a_190_signed.pdf
6 See: https://ag.ny.gov/internet/data-breach
7 See: https://iclg.com/practice-areas/data-protection-laws-and-regulations/usa
The Privacy Act of 1974 covers federal agencies and the records they control (U.S. Congress, 1974). In terms of private sector data assets, laws govern data privacy and protection as well as the collection of information online. Several sector-specific federal laws govern data privacy and use (Osano, 2020). For example, the Health Insurance Portability and Accountability Act (HIPAA) governs the collection of health information. The Gramm Leach Billey Act governs personal information collected by banks and financial institutions. The Fair Credit Reporting Act regulates the collection and use of credit information. The Family Educational Rights and Privacy Act (FERPA) protects the confidentiality of student education records. Most of these laws permit limited research and statistical uses. Standardizing procedures for research and statistical use of data would aid the functioning of a new data infrastructure.
By contrast, private sector data brokers collect, buy, aggregate, and sell data on individuals and companies for profit, with few legal restrictions. The largest data brokers include Acxiom LLC, Epsilon Data Management LLC, Equifax, Experian, and CoreLogic (Privacy Bee, 2021). Consumers (the data subjects) are often unaware of data brokers’ existence or practices. Consumers generally do not provide express permission or consent for use of their data by data brokers. Data brokers are almost entirely unregulated; there is no federal law regulating businesses that buy and sell personal information, and businesses face few penalties for causing harm to data subjects. However, two states, Vermont and California, have enacted data-broker laws.9
The growth of crowdsourced and citizen-science data has also raised concerns about data privacy and data protection (Eticas Foundation, 2020). When the public is enlisted to collect and/or share data, there is substantial variation in the level and quality of training they receive about such issues as harm mitigation, confidentiality, or ethical uses of data.
Inadequate protection for the autonomy of statistical agencies is another concern. Citro et al. (2022) provide three main findings regarding the autonomy of principle statistical agencies:
- The challenges faced by statistical agencies arise largely as a consequence of insufficient autonomy.
- There is remarkable variation in autonomy protections and a surprising lack of statutory protections for many agencies for many of the proposed measures. Only four statistical agencies have agency-specific autonomy protections. The remaining nine agencies are protected in varying degrees by blanket provisions, MOUs, or other defenses.
9 See: https://www.classlawgroup.com/consumer-protection/privacy/data-brokers/
- Many existing autonomy rules and guidelines are weakened by unclear or unactionable language (Citro et al., 2022, pp. 2–3).
Finally, the panel is not aware of any laws or regulations that prohibit companies or individuals who are not sworn agents of statistical agencies from profiting from information that they provide to statistical agencies. For example, in the panel’s opinion, it will be important for data-sharing arrangements to prohibit parties from financial trading in advance of statistical releases, based on private information those parties provided to statistical agencies. Similarly, arrangements must also prohibit intentional manipulation of shared information to bias official statistics for any purpose.
In short, it is the panel’s opinion that the current legal and regulatory framework that limits which data assets can be shared, with whom, and for what purposes does not satisfy the demands of a modern data infrastructure. The current framework prohibits beneficial sharing and lacks consistent requirements to preserve privacy, protect confidentiality, assure autonomy, and prevent abuse of data-sharing arrangements. Thus, legislative reform is needed.
Legislative and Regulatory Reform Is Required for a 21st Century Data Infrastructure
The Evidence Act is a major advance in promoting data access and use, but the full promise of the Commission on Evidence-based Policymaking (CEP) is yet to be achieved. In its deliberations about the implications of the Evidence Act, the Advisory Committee on Data for Evidence Building (ACDEB) suggested legislative actions be taken as soon as possible to promote expanded data access and use (Box 3-5). The panel assumes that, in its vision of a new data infrastructure, such steps will be taken successfully so that additional access to private sector data can occur.
Legal and regulatory changes are necessary to achieve the full promise of a 21st century national data infrastructure. (Conclusion 3-3)
Attribute 5: Governance Framework and Standards Effectively Supporting Operations
In the panel’s view, the legal reform needed to underlie a new data infrastructure must be accompanied by a set of practices and policies consistent with the spirit of those new laws. “Data governance” refers to a framework of protocols that guide such practices—it is much more than a
process. Some frameworks define data governance as “the ability to manage the life cycle of data through the implementation of policies, processes and rules in accordance with the organisation’s strategic objectives.”10
For discussion purposes, the panel defines the data-governance framework as including the authorities; structures; roles and responsibilities; policies, rules, and directives; guiding principles; and resources needed to support a new data infrastructure. Key data infrastructure capabilities include acquiring, accessing, using, managing, and protecting data assets. Data governance thus involves organizations, people, processes, policies, and technologies. It is ideally characterized by active stakeholder engagement; open and transparent communication; clear rules, procedures, and mechanisms; accountability; and oversight (British Academy and the Royal Society, 2017).
In the panel’s judgment, data governance is crucial and must address the blending of multiple data sources, the need to respect the interests and
10 See Common Statistical Data Architecture Capabilities: https://statswiki.unece.org/download/attachments/129177312/HLG-MOS%20Reference%20Data%20Architecture%20v1.0.docx?version=1&modificationDate=1516727545541&api=v2
rights of data subjects, and changing notions of privacy and consent. Blending survey data with new data sources, like private sector data, may raise additional ethical issues that warrant study. Governance must achieve more equitable data use in the face of a widening data divide, and it must recognize that risks and benefits associated with data use vary depending on the context and the purpose for which data are being used. Finally, governance must manage the burden imposed on data subjects and data holders, and be alert to the ever-present risk of a public data-related controversy (British Academy and the Royal Society, 2017).
The panel examined the evolving discussion surrounding data governance in the U.S. and governance protocols emerging from other countries. The principles underlying the governance framework necessary for a new data infrastructure are precisely those articulated in the panel’s vision—deep devotion to privacy-protecting mechanisms, respect for data holders’ and data subjects’ interests and rights, and the provision of tangible benefits to those who share data for blending to produce improved information on critical societal features.
Components of data governance guide decisions for acquiring, sharing, using, managing, safeguarding, and stewarding data. The governance framework shapes important data-governance components, as shown in Box 3-6.11 Many of these data-governance components require the active engagement of the diverse stakeholders (data subjects, data holders, and responsible organizations) within a data infrastructure. Such involvement is of growing prevalence (British Academy and the Royal Society, 2017) and is a central feature of both the U.K. Data Strategy (U.K. Department for Digital, Culture, Media & Sport, 2019) and the European Data Strategy (European Commission, n.d.).
While the National Academies’ December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure touched on data governance, they did not focus on specific data-governance requirements. Faced with an evolving data ecosystem across the nation and the world, the panel advises that the governance framework for a new data infrastructure be adaptable, grounded in fundamental principles, and able to recognize and address possible tensions between increasing data use and the possible risks of that use.
At this stage of its evolution, much of the governance framework of a new data infrastructure must remain unspecified. The nature of the
11 See: https://unece.org/sites/default/files/2021-11/HLG2021_D1_ProjectProposal%20Statistical%20Data%20Governance%20Framework.pdf
governance framework will depend upon both the passage of necessary legal reforms and the organizational structures chosen to implement the governance procedures. However, certain key questions can be identified now. The panel recommends consideration of the following questions to aid development of an effective, comprehensive, and responsive data-governance framework for a new data infrastructure:
- What data-governance authorities are needed? Authority for overall governance of a new data infrastructure needs to be clarified, along with possible authority needed by any oversight body.
- Who makes the decision to permit statistical uses of multiple data sources?
- Who can impose sanctions for violations of data-use agreements?
- Who ensures that data are properly documented?
- What form of governance oversight is needed over the full data infrastructure? The National Academies recommends a board of directors (the National Academies, 2017a); CEP recommends a steering committee that includes representatives of the public, federal departments, state agencies, and academia (Commission on Evidence-Based Policymaking, 2017). Questions remain:
- Is the board or steering committee simply advisory, or does it have specific authority?
- How does the board or steering committee composition reflect the appropriate multistakeholder diversity?
What accountability mechanisms are needed to ensure compliance with laws, policies, and standards and to ensure ethical practices related to data management, data use, and data users? The Federal Data Ethics Framework tenet states:
Accountability requires that anyone acquiring, managing, or using data be aware of stakeholders and be responsible to them, as appropriate. Remaining accountable includes the responsible handling of classified and controlled information, upholding data use agreements made with data providers [data holders], minimizing data collection, informing individuals and organizations of the potential uses of their data, and allowing for public access, amendment, and contestability to data and findings, where appropriate (Federal Data Strategy, n.d., p. 4).
- What accountability mechanisms need to be established?
- How will the system provide remedy and redress?
Standards: An Important Component of Data Governance
Just as data governance implements the spirit of legal reform permitting a new data infrastructure, “standards” are the logical implementations for some features of data governance and are important building blocks for a new data infrastructure. Standards facilitate the acquisition, collection, and organization of data, but also support other important data-infrastructure capabilities, such as blending of data.
Useful data sharing requires interoperability. Data have to be exchanged across users and should be comparable over time, space, and subpopulations. For example, interoperability in healthcare means separate systems, devices, organizations, and entities can exchange and appropriately use health-related data. Comparability means the unemployment rate in Los Angeles should have the same meaning as that in rural West Virginia. The unemployment rate in March 2022 should have the same interpretation as the unemployment rate in January 2018. Unemployment among those 18–25 years old should mean the same as unemployment among those 45–60 years old. Interoperability and comparability rest on the consistent application of standards for data elements, classification schemes, documentation, time periods, and more.
The interoperable exchange of multiple data sources starts with data documentation, commonly referred to as metadata. Metadata require the use of standards, especially for data documentation. Statistical surveys have relied on standard, consistent questions and definitions to collect information from individuals, households, establishments, or enterprises. The panel
recognizes that survey participants may not recall or retain the requested information, business records may not align with the concept a survey is attempting to measure, or survey participants may not read the instructions or understand the questions. Yet, survey questions developed by statistical agencies or their contractors have been tested, the recordkeeping practices of businesses are periodically assessed,12 and statistical data assets commonly contain associated metadata. The situation for other data-holder data assets may be quite different, but standards are critical if data are to be exchanged and used for statistical purposes. Lack of standards is one of the biggest obstacles to using health data, for example (Moyer, 2021).
Central governments often encourage the use of statistical standards to ensure the usefulness of statistical information. Statistical standards result in the production and publication of consistent, comparable information. In the United States, for example, there are standard definitions of metropolitan areas, of occupations, of industries and products, and basic units of measurement (e.g., length, width, weight). These standards, when accepted by populations, become an integral part of how businesses compare one another, how institutions assess their size, and how households assess their welfare. As Katherine Wallman, workshop participant, noted, data standards also can be an important gift to data holders and can incentivize them to share their data. Standards permit the coordination of actors on shared documentation, and, in return, government statistical agencies can use standardized data to report back to stakeholders, creating a virtuous cycle of standards and information useful to society.
In the panel’s judgment, a new data infrastructure needs to adopt existing data standards, when appropriate, and promote the creation of new standards. Standardizing the definitions of the top 25 health data elements should be a priority, for example, as noted by workshop participant Niall Brennan. Shared standards are key to measuring data quality. Coverage, for example, depends on a shared understanding of the total data universe. Given that different data sources represent different segments of a universe (e.g., only businesses with paid employees) or even different universes (e.g., households vs. businesses), understanding coverage is an essential part of ensuring that a new data infrastructure benefits all.
No single data standard exists or is possible. Appendix 3B contains a partial listing of many alternative data-exchange and metadata standards, including global consortia of organizations seeking to increase the interoperability of data. The partial list demonstrates the existence of standards across diverse sectors of data. There are many standards-developing organizations and, consequently, many distinct standards. At the federal level, “adopting standards is a means by which the separate U.S. federal statistical
12 See: https://play.google.com/books/reader?id=QzODHZLU9Z0C&pg=GBS.PA18&hl=en
agencies can achieve some uniformity and interoperability among their data and metadata systems in terms of both data stores and services” (the National Academies, 2021, p. 116). It is important to note that the most effective standards are created with ongoing input from stakeholders—stakeholders are the implementers of the standards and thus must realize benefits from them.
Matthew Gee discussed the use-cases that were primary drivers of a public-private effort to create uniform interoperable standards for jobs and employment records kept by employers (Gee, 2021). Ivan Deloatch, workshop participant, discussed the use-cases that were primary drivers for a public-private effort to create uniform interoperable standards for jobs and employment records kept by employers (Gee, 2021). Ivan Deloatch, Federal Geographic Data Committee, shared the experiences of the multiagency geospatial community in defining standards and interoperability over a complex landscape of data collection and use.13 In short, a modern data infrastructure resting on more data sharing would be greatly advanced by the adoption of common standards across the partners.
Effective data governance is critical and should be inclusive and accountable; governance policies and standards facilitating interoperability include key stakeholders and oversight bodies. (Conclusion 3-4)
Attribute 6: Transparency to the Public Regarding Analytical Operations Using the Infrastructure
In addition to the attributes described above, the panel believes that transparency is critical to building the trust essential to engendering widespread support for a new data infrastructure.14 A new data infrastructure, in the panel’s view, must be viewed as legitimate by the participating data holders, data subjects, and society at large. A new infrastructure will include more sources of data from more data holders on more data subjects than did the data infrastructure of the 20th century. To be useful indicators of the country’s welfare, the statistical information derived from a new data infrastructure must be credible and trusted. Transparency is critical, and “those engaged in generating and using data and evidence should operate transparently, providing meaningful channels for public input and comment and ensuring that evidence produced is made publicly available” (Commission on Evidence-Based Policymaking, 2017, p. 17).
13 See: https://www.fcdc.gov/standards/fgdc-standards-program-overview
14 For a more in-depth discussion of the importance of transparency in official statistics, see the National Academies (2022a, pp. 35–38).
Therefore, in the panel’s opinion, transparency must be a stated requisite in the legal basis of a new data infrastructure, as well as part of that infrastructure’s data-governance framework. Formal governance roles can be designed to enhance transparency. Transparency is also a prerequisite for accountability, which enables the public to express concerns, seek redress, and oversee compliance with the infrastructure’s stated mission. For example, Organisation for Economic Co-operation and Development (OECD) member countries have passed legislation identifying individuals or institutions responsible for overseeing access to and dissemination of data in their respective countries:
- An ombudsman or mediator (e.g., in New Zealand, Norway, and Sweden);
- An information commissioner (e.g., in Germany, Hungary, Scotland, Slovenia, and the United Kingdom [U.K.]);
- A commission or institution (e.g., in Chile, France, Italy, Mexico, and Portugal);
- Another body responsible for monitoring this right, such as the Right to Information Assessment Review Council and the ombudsman in Turkey, both of which ensure the observance of all relevant laws (OECD, 2019, p. 15).
In the United Kingdom, establishment of the Statistics Authority was an attempt to achieve such visible oversight. As stated on the Authority’s website:
The Authority is an independent statutory body. It operates at arm’s length from government as a non-ministerial department and reports directly to the U.K. Parliament, the Scottish Parliament, the National Assembly for Wales and the Northern Ireland Assembly. The work of the Authority is further defined under secondary legislation made under the Act by the U.K. Parliament or the devolved legislatures.
The Authority has a statutory objective of promoting and safeguarding the production and publication of official statistics that ‘serve the public good.’ The public good includes:
- informing the public about social and economic matters; and
- assisting in the development and evaluation of public policy; and
- regulating quality and publicly challenging the misuse of statistics (U.K. Statistics Authority, 2022a).
The panel notes that the current United States legal and governance framework does not supply the level of transparency that these formal entities are promulgating.
The panel is cognizant of how tools of transparency have historically been manipulated to undermine trust in institutions, data, and science
(Pozen, 2018). Transparency alone cannot ensure that the public trusts a new data infrastructure, but a new data infrastructure cannot be trusted without transparency. Therefore, in the panel’s opinion, transparency cannot be an end goal, but must be a commitment to iteratively engage with stakeholders to ensure an effective flow of information.
In sum, the panel advises that a 21st century national data infrastructure should seek the active engagement of diverse communities of interest, advocates of privacy protection, proponents of the benefits of broader data sharing, those concerned about data equity, and those promoting new national statistics. Multistakeholder participation is needed, and data holders must be represented and given a voice in making decisions that affect them, as well as in developing standards and establishing policies. At any time, the public, data holders, and data subjects should be able to know how their data are being used, by whom, for what purposes, and to what societal benefit. Transparent communication with the public, data holders, data subjects, and all relevant constituencies about how data are used and protected and how they are benefiting society can help instill confidence in a new data infrastructure and eventually result in societal trust in and “ownership” of that infrastructure.
Trust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders. (Conclusion 3-5)
Attribute 7: State-of-the-Art Practices for Access, Statistical, Coordination, and Computational Activities; Continuously Improved to Efficiently Create Increasingly Secure and Useful Information
The panel notes that the skills required to support research and statistical operations on infrastructure data blended from diverse sources (see Chapter 4) include a combination of knowledge about network features, cybersecurity, secure multiparty computing, encryption, and other fields. These are new skills for many social and economic scientists who have, for decades, developed tools for measurement, data collection, data curation, data storage, and statistical disclosure analysis.
In the panel’s judgment, the access and use of diverse data assets held by distinct data holders in various sectors will involve new partners who have divergent experiences with digital data. Data-seeking organizations will require expertise in working with data holders to understand the basic processes generating their data. Data-seeking organizations will need to discern which metadata standards are best suited to the data-holder organization, and they will need to understand and adapt to the data standards of the data holder. Data-seeking organizations will need to develop new
frameworks for evaluating and communicating uncertainty and error in data. They may need to use stronger cybersecurity and privacy-protecting data curation. They must address the legal and procedural questions that the data holder will ask when initially considering sharing data assets. The panel expects that these skills need to evolve over time and that a new infrastructure should account for the dynamic nature of the digital society.
On the computational and statistical side, the panel believes that the data-seeker must have the talent to blend data together for more insightful research and statistical products. For example, pilot research projects have led to a new vision of how data might be shared between companies and statistical agencies in the 21st century (Haltiwanger et al., 2021). Because data holdings are so large that they cannot be transported from the data holder to the data-seeking statistical agency, active work is focusing on how data can be usefully and safely accessed and processed where they currently reside (i.e., at the data holder’s facility). Software residing in the data holder’s domain is being designed to fully comply with the data holder’s user agreements. Software is designed to act on the data holder’s existing data to produce aggregates that serve as the statistical building blocks that a federal statistical agency might use directly or blend with other survey or census data. Developing mechanisms to interoperate across distributed data sources will be key to a new data infrastructure, according to the panel’s vision.
In the panel’s opinion, such projects suggest that a new data infrastructure should have a novel, distributed design, with pre-vetted, safe software embedded behind the firewalls of both the data-seeking and data-holding organizations. Software, written collaboratively by both parties, would create pre-specified aggregates of data as described above. At a data holder’s site, the software would produce, for the data seeker, new statistical products or intermediate inputs into other products. Existing and new products would be provided simultaneously to both the data holder and to the public. These new products could include benchmarks for data holders as well as detailed analyses of the performance of an economic sector (e.g., the changes in product diversity over time in an industry) while not undermining competition within an industry or sector.
As mentioned in Chapter 2, such pilot projects are now ongoing in federal statistical agencies, and the panel expects that important lessons will be learned. Like the Committee for National Statistics’ report, Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps (the National Academies, 2017a), the panel notes that computational approaches are likely to evolve over time, as innovation in computational and statistical methods for blended data continue. The sustainability of a new national data infrastructure requires practices that closely track these innovations and adapt to incorporate new technologies, methods, and capabilities.
While new technologies and advanced methods may offer significant benefits, they may also increase bias or inequity, so attention must be paid to ethical use of these innovations (Federal Data Strategy, n.d.). Smaller organizations and less-resourced data users may find running such pilots and adopting and maintaining new technologies to be difficult, which presents another practical challenge. Many likely lack the economies of scale and scope to employ highly skilled technologists and to purchase advanced software and hardware. Sharing services with a host organization may not be a solution if the organizations do not have similar needs, and sharing may be risky if it diminishes the autonomy of a statistical agency.
The operations of a new data infrastructure would benefit from the inclusion of continually evolving practices, methods, technologies, and skills, to ethically leverage new technologies and advanced methods. (Conclusion 3-6)
This chapter has described a vision for a 21st century national data infrastructure along with seven key attributes that are listed in Box 3-2. Achieving the vision of a new data infrastructure with these attributes will not be easy, but it can be done. This chapter has identified important changes that are needed to achieve the requisite attributes, to make the vision of a 21st century national data infrastructure a reality. In the panel’s judgment, commitment and action are required to fully realize the promise of this new infrastructure.
LAWS AND OFFICE OF MANAGEMENT AND BUDGET GUIDANCE ON CONFIDENTIALITY AND PRIVACY PROTECTION15
Protecting the confidentiality of individual information collected under a confidentiality pledge—whether from individuals, households, businesses, or other organizations—is a bedrock principle of federal statistics. Federal statistical agencies also strive to respect the privacy of individual respondents through such means as limiting the collection of information to that which is necessary for an agency’s mission. Respect for privacy has a history in federal legislation and regulation that extends back many decades; so, too, does protection of confidentiality, except that not all federal agencies were covered.16 With the original passage of CIPSEA in 2002 (see below), a firm legislative foundation was established for confidentiality protection of statistical data governmentwide.
Privacy Act of 1974
The Privacy Act of 1974 (P.L. 93-579, as amended; codified at 5 USC 552a) is a landmark piece of legislation that grew out of concerns about the implications of computers, credit bureaus, proposals for national databanks, and the like on personal privacy. The act states in part (5 USC 552a(b)):
No agency shall disclose any record which is contained in a system of records by any means of communication to any person, or to another agency, except pursuant to a written request by, or with the prior written consent of, the individual to whom the record pertains, unless disclosure of the record [is subject to one or more of 12 listed conditions].
The defined conditions for disclosure of personal records without prior consent include use for statistical purposes by the Census Bureau, for statistical research or reporting when the records are to be transferred in a form that is not individually identifiable for routine uses within a U.S. government agency, for preservation by the National Archives and Records Administration “as a record which has sufficient historical or other value to warrant its continued preservation by the United States Government,” for
15 Excerpted from the National Academies (2021), pp. 159–168. Citations in this section are from the original text and are not included in this report’s Reference list. See orignal text at https://nap.nationalacademies.org/catalog/24810 for complete details.
16 For example, Title 13 of the U.S. Code, providing for confidentiality protection for economic and population data collected by the U.S. Census Bureau, dates back to 1929; in contrast, the Bureau of Labor Statistics had no legal authority for its policies and practices of confidentiality protection until the passage of CIPSEA in 2002 (see NRC, 2003, pp. 119–121).
law enforcement purposes, for congressional investigations, and for other administrative purposes.
The Privacy Act mandates that every federal agency have in place an administrative and physical security system to prevent the unauthorized release of personal records; it also mandates that every agency publish in the Federal Register one or more system of records notices (SORNs) for newly created and revised systems of records that contain personally identifiable information as directed by OMB.17 SORNs are to describe not only the records and their uses by the agency, but also procedures for storing, retrieving, accessing, retaining, and disposing of records in the system.18
Federal Policy for the Protection of Human Subjects, 45 Code of Federal Regulations (CFR) 46, Subpart A (“Common Rule”), as Revised in 2017
The 1991 Common Rule regulations, promulgated by the U.S. Department of Health and Human Services (DHHS)19 and signed onto by nine other cabinet departments and seven independent agencies (in their own regulations), represent the culmination of a series of DHHS regulations dating back to the 1960s (see Practice 7 and NRC, 2003, Ch. 3). The regulations are designed to protect individuals whom researchers wish to recruit for research studies funded by the federal government, which include surveys and other kinds of statistical data collection.20
These regulations require that researchers obtain informed consent from prospective participants, minimize risks to participants, balance risks and benefits appropriately, select participants equitably, monitor data collection to ensure participant safety (where appropriate), and protect participant privacy and maintain data confidentiality (where appropriate). Institutional Review Boards (IRBs) at universities and other organizations and agencies registered with DHHS review research protocols to determine whether they qualify for exemption from or are subject to IRB review and, if the latter, whether the protocol satisfactorily adheres to the regulations. Some federal statistical agencies are required to submit data-collection protocols to an IRB for approval; other agencies maintain exemption from IRB review but follow the principles and spirit of the regulations.
17 See Office of Management and Budget (2016).
18 For an example of SORNs for a statistical agency, see https://www.census.gov/about/policies/privacy/sorn.html [February 2021]
19 See: https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/ (February 2021). In addition to Subpart A of 45 CFR 46, DHHS and some other departments and agencies have signed onto Subparts B, C, and D, which pertain to pregnant women, human fetuses, and neonates; prisoners; and children, respectively.
20 Of those departments with statistical units, all signed onto the Common Rule with the exception of the Departments of Labor and the Treasury.
An Advance Notice of Proposed Rulemaking, issued in 2011, proposed changes to the Common Rule, including revisions to the provisions for confidentiality protection.21 A Notice of Proposed Rulemaking, which indicated responses to the extensive comments on that advance notice, was issued in 2015; it too included a comment period.22 A final rule was published January 19th, 2017,23 which took effect on January 19th, 2018 (for cooperative research involving more than one institution, the effective date was January 20th, 2020). Some of the changes from the 1991 version of the Common Rule are these:
- The U.S. Department of Labor became a signatory to the Common Rule; consequently, only one department that houses a federal statistical agency (U.S. Department of the Treasury) is not a signatory.
- Provisions to exempt research with human participants from IRB review were modified and enlarged and, where appropriate, IRB review is to be focused on the adequacy of confidentiality protection.
- To assist IRBs in determining the adequacy of confidentiality protection, the Secretary of DHHS, after consultation with OMB and other federal signatories, is to issue guidance on what provisions are adequate to protect the privacy of subjects and to maintain the confidentiality of data.
- Provisions are added for “broad” consent for storage, maintenance, and secondary research use of identifiable private information or biospecimens.
1997 Order Providing for the Confidentiality of Statistical Information
OMB issued this order in 1997 to bolster the confidentiality protections afforded by statistical agencies or unit (as listed in the order), some of which lacked legal authority to back up their confidentiality protection.24 CIPSEA (see next section) placed confidentiality protection for statistical information on a strong legal footing across the entire federal government.
21 See 76 Federal Register 44512 (July 26th, 2011). Available: https://www.federalregister.gov/d/2011-18792. See also NRC (2014).
22 See 80 Federal Register 53933 (September 8th, 2015). Available: https://www.federalregister.gov/d/2015-21756
23 See 82 Federal Register 7149 (January 19th, 2017). Available: https://www.federalregister.gov/d/2017-01058
24 See 62 Federal Register 35044 (June 27th, 1997). Available: https://www.federalregister.gov/d/97-16934
Confidential Information Protection and Statistical Efficiency Act
The Confidential Information Protection and Statistical Efficiency Act (CIPSEA) was first enacted as Title V of the E-Government Act of 2002 (P.L. 107-347) and was recodified as part of the Evidence-Based Policymaking Act of 2018 (see above). CIPSEA provides a strong statutory basis for the statistical system with regard to confidentiality protection and data sharing. CIPSEA has four parts: two original parts cover confidentiality (Part B) and data sharing (Part C; efficiency), respectively, while the other two parts include definitions and Statistical Policy Directive No. 1 (Part A), and Access to Data for Evidence (Part D; see Evidence Act above).
Part B, Confidential Information Protection
Part B of CIPSEA strengthens and extends statutory confidentiality protection for all statistical data collections of the U.S. government. Prior to CIPSEA, such protection was governed by a patchwork of laws applicable to specific agencies, judicial opinions, and agencies’ practices. For all data furnished by individuals or organizations to an agency under a pledge of confidentiality for exclusively statistical purposes, Subtitle A provides that the data will be used only for statistical purposes and will not be disclosed in identifiable form to anyone not authorized by the title. It makes the knowing and willful disclosure of confidential statistical data a class E felony, with fines up to $250,000 and imprisonment for up to five years.
Subtitle A pertains not only to surveys, but also to collections by a federal agency for statistical purposes from nonpublic administrative records (e.g., confidential state government agency records). Data covered under Subtitle A are not subject to release under a Freedom of Information Act request.
Part C, Statistical Efficiency
Part C of CIPSEA permits the BEA, the BLS, and the Census Bureau to share individually identifiable business data for statistical purposes. The subtitle has three main purposes: (1) to reduce respondent burden on businesses; (2) to improve the comparability and accuracy of federal economic statistics by permitting these three agencies to reconcile differences among sampling frames, business classifications, and business reporting; and (3) to increase understanding of the U.S. economy and improve the accuracy of key national indicators, such as the National Income and Product Accounts.
However, this part does not authorize any new sharing among BEA, BLS, and the Census Bureau of any individually identifiable tax return data that originate from the Internal Revenue Service (IRS). This limitation
currently blocks some kinds of business-data sharing, such as those for sole proprietorships, which are important for improving the efficiency and quality of business-data collection by statistical agencies. For tax return information, data sharing is limited to a small number of items for specialied uses by a small number of specific agencies (under Title 26, Section 6103 of the U.S. Code, and associated Treasury Department regulations, as modified in the 1976 Tax Reform Act). The governing statute would have to be modified to extend sharing of tax return items to agencies not specified in the 1976 legislation. Although proposals for legislation to expand access to IRS information for limited statistical purposes have been discussed and developed through interagency discussions, they have not received necessary congressional approval.
CIPSEA Implementation Guidance
OMB originally released implementation guidance for CIPSEA in 2007 (U.S. Office of Management and Budget, 2007). The guidance covered such topics as the steps that agencies must take to protect confidential information; wording of confidentiality pledges in materials that are provided to respondents; steps that agencies must take to distinguish any data or information they collect for nonstatistical purposes and to provide proper notice to the public of such data; and ways in which agents (e.g., contractors, researchers) may be designated to use individually identifiable information for analysis and other statistical purposes and be held legally responsible for protecting the confidentiality of that information. Under the Evidence Act, OMB is charged with promulgating guidance for implementation of a process to designate statistical agencies and units.25 A total of 16 agencies and units are currently so recognized (see Appendix B).
Privacy Impact Assessments Required Under the E-Government Act of 2002, Section 208
Section 208 of the E-Government Act of 2002 (P.L. 107-347) requires federal agencies to conduct a privacy impact assessment whenever an agency develops or obtains information technology that handles individually identifiable information or whenever the agency initiates a new collection of individually identifiable information.26 The assessment is to be made publicly available and cover topics such as what information is being collected and why, with whom the information will be shared,
25 44 USC 3562(a).
26 Section 208 also mandates that OMB lead interagency efforts to improve federal information technology and use of the Internet for government services.
what provisions will be made for informed consent regarding data sharing, and how the information will be secured. Typically, privacy impact assessments cover not only privacy issues, but also confidentiality, integrity, and availability issues.27 OMB was required to issue guidance for development of the assessments, which was done in a September 26th, 2003, memorandum (M-03-22) from the OMB director to the heads of executive agencies and departments.28
Section 208, together with Title III, FISMA (see below), and Title V, CIPSEA (see above), of the 2002 E-Government Act are the latest in a series of laws beginning with the Privacy Act of 1974, that govern access to individual records maintained by the federal government (see also Federal Cybersecurity Enhancement Act of 2015, below).
Federal Information Security Management Act of 2002
FISMA was enacted in 2002 as Title III of the E-Government Act of 2002 (U.S. Congress, 2002a) to bolster computer and network security in the federal government and affiliated parties (such as government contractors) by mandating yearly audits.
FISMA imposes a mandatory set of processes that must be followed for all information systems used or operated by a federal agency or by a contractor or other organization on behalf of a federal agency. These processes must follow a combination of Federal Information Processing Standards documents, the special publications issued by the National Institute of Standards and Technology (SP-800 series), and other legislation pertinent to federal information systems, such as the Privacy Act of 1974 and the Health Insurance Portability and Accountability Act of 1996.
The first step is to determine what constitutes the “information system” in question. There is no direct mapping of computers to an information system; rather, an information system can be a collection of individual computers put to a common purpose and managed by the same system owner. The next step is to determine the types of information in the system and categorize each according to the magnitude of harm that would result if the system suffered a compromise of confidentiality, integrity, or availability. Succeeding steps are to develop complete system documentation, conduct a risk assessment, put appropriate controls in place to minimize risk, and arrange for an assessment and certification of the adequacy of the controls.
27 See, for example, the available privacy impact assessments prepared by the Census Bureau at https://www.census.gov/about/policies/privacy/pia.html
28 See: https://www.whitehouse.gov/wp-content/uploads/2017/11/203-M-03-22-OMBGuidance-for-Implementing-the-Privacy-Provisions-of-the-E-Government-Act-of-2002-1.pdf
FISMA affects federal statistical agencies directly in that each agency must follow the FISMA procedures for its own information systems. In addition, some departments are taking the position that all information systems in a department constitute a single information system for the purposes of FISMA: those departments are taking steps to require that statistical agencies’ information systems and personnel be incorporated into a centralized, department-wide system.
Federal Information Technology Acquisition Reform Act of 2014
The Federal Information Technology Acquisition Reform Act (FITARA) was enacted on December 19, 2014, to respond to such federal information technology (IT) challenges as duplicate IT spending among and within agencies, difficulty in understanding the cost and performance of IT investments, and inability to benchmark IT spending between federal and private-sector counterparts. FITARA has four major objectives: (1) strengthening the authority over and accountability for IT costs, performance, and security of agency chief information officers (CIOs); (2) aligning IT resources with agency missions and requirements; (3) enabling more effective planning for and execution of IT resources; and (4) providing transparency about IT resources across agencies and programs. It requires agencies (defined as cabinet departments and independent agencies) to pursue a strategy of consolidation of agency data centers, charges agency CIOs with the responsibility for implementing FITARA, and charges the U.S. Government Accountability Office with producing quarterly scorecards to assess how well agencies are meeting the FITARA objectives.
The director of OMB issued implementation guidance for FITARA, M-15-14, Management and Oversight of Federal Information Technology, on June 20th, 2015.29 This memorandum explicitly stated that agencies must implement the FITARA guidance to ensure that information acquired under a pledge of confidentiality solely for statistical purposes is used exclusively for those purposes. It also provided a “Common Baseline for IT Management,” which lays out FITARA responsibilities of CIOs and other agency officials, such as the chief financial officer and program officials. On May 4th, 2016, the federal CIO and the administrator of OIRA, both in OMB, jointly issued Supplemental Guidance on the Implementation of M-15-14 “Management and Oversight of Federal Information Technology”—Applying FITARA Common Baseline to Statistical Agencies and Units (U.S. Office of Management and Budget, 2016). This supplemental guidance posing questions for CIOs and other officials, including
29 See: https://obamawhitehouse.archives.gov/sites/default/files/omb/memoranda/2015/m-15-14.pdf
statistical agency heads, to address when implementing FITARA for statistical agency programs. The questions refer to the fundamental responsibilities of federal statistical agencies outlined in Statistical Policy Directive No. 1 (see above), which include confidentiality protection and meeting deadlines for key statistics.
Federal Cybersecurity Enhancement Act of 2015
The Federal Cybersecurity Enhancement Act of 2015 is Title II, Subpart B, of the Cybersecurity Act of 2015, which was attached as a rider to the Consolidated Appropriations Act of 2016, and so became law (P.L. 114- 113) when the appropriations bill was signed on December 18, 2015. The impetus for Title II, Subpart B, was the efforts of the U.S. Department of Homeland Security (DHS), dating back to 2003, to deploy systems for detection and prevention of intrusions (“hacking”) into federal government information networks (see Latham and Watkins, 2016, p. 3). At of the end of 2015, this technology, known as EINSTEIN, covered only 45 percent of federal network access points. The act requires DHS to “make [EINSTEIN] available” to all federal agencies within one year, and thereafter requires all agencies to “apply and continue to utilize the capabilities” across their networks.
The technology, currently in version E3A, has been welcomed by federal statistical agencies, but agencies initially were concerned about a DHS interpretation of the act that would allow DHS staff to monitor traffic on agency networks and follow up on actual or likely intrusions. Such surveillance by DHS staff could lead to violations of agencies’ pledges to protect the confidentiality of information provided by individual respondents for statistical purposes, which state that only statistical agency employees or sworn agents can see such information. Ultimately, DHS retained its surveillance authority, and statistical agencies modified their confidentiality pledges. As described in a Federal Register notice from the U.S. Census Bureau (other statistical agencies have issued similar notices).30
DHS and Federal statistical agencies, in cooperation withtheir parent departments, have developed a Memorandum of Agreement for the installation of Einstein 3A cybersecurity protection technology to monitor their Internet traffic and have incorporated an associated Addendum on Highly Sensitive Agency Information that provides additional protection and enhanced security handling of confidential statistical data. However, many current Title 13, U.S.C. and similar statistical confidentiality pledges
30 Agency Information Collection Activities; Request for Comments; Revision of the Confidentiality Pledge Under Title 13 United States Code, Section 9, 81 Federal Register 94321 (December 23rd, 2016). Available: https://www.federalregister.gov/d/2016-30959
promise that respondents’ data will be seen only by statistical agency personnel or their sworn agents. Since it is possible that DHS personnel could see some portion of those confidential data in the course of examining the suspicious Internet packets identified by Einstein 3A sensors, statistical agencies need to revise their confidentiality pledges to reflect this process change.
The BLS led an interagency research program to test revised wording with samples of respondents, and agencies revised their pledges accordingly. As an example, the Census Bureau’s revised pledge, provided in 81 Federal Register 94321 (December 23rd, 2016; see footnote 30), states:
The U.S. Census Bureau is required by law to protect your information. The Census Bureau is not permitted to publicly release your responses in a way that could identify you. Per the Federal Cybersecurity Enhancement Act of 2015, your data are protected from cybersecurity risksthrough screening of the systems that transmit your data.
EXAMPLES OF STANDARDS THAT WOULD BE USEFUL TO ANY NEW DATA INFRASTRUCTURE
- Statistical Data and Metadata eXchange (SDMX): Formatting multidimensional data and metadata into a framework for automated data exchange among organizations.31
- The United Nations Economic Commission for Europe family of standards: Generic Statistical Business Process Model (GSBPM; defines the set of business processes needed to produce official statistics), Generic Statistical Information Model (GSIM; a reference framework of internationally agreed-upon definitions, attributes, and relationships that describe the information objects used in the production of official statistics), Common Statistical Production Architecture (CSPA; a reference architecture for the statistics industry covering GSBPM processes and providing a link between GSIM and GSBPM), and Common Statistical Data Architecture (CSDA; provides a data-centric view of a statistical institute’s architecture, putting a focus on data, metadata, and data capabilities needed to treat data as an asset).32
- The Data Documentation Initiative (DDI): An international standard that can document and manage specific stages in the research data lifecycle, such as conceptualization, collection, processing, distribution, discovery, and archiving.33
- The National Information Exchange Model (NIEM): A common vocabulary that enables efficient information exchange across diverse public and private organizations. The NIEM defines agreed-upon terms, definitions, relationships, and formats—independent of how information is stored in individual systems—for data being exchanged.34
- The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) 11179: Provides a conceptual model for managing classification schemes. There are many structures used to organize classification schemes, and there are many subject matter areas that classification schemes describe. So, ISO/IEC 11179 also provides a two-faceted classification for classification schemes themselves.35
31 See: https://sdmx.org/
32 See: https://unece.org/statistics/standards-and-metadata
33 See: https://ddialliance.org/
34 See: https://www.niem.gov
- International and domestic standards for electronic data interchanges: The predominant electronic data interchange standard in the U.S. is ANSI X12. The Securities and Exchange Commission uses XBRL for company financial reporting in EDGAR.36
- The Geospatial Data Act of 2018: This Act established the Federal Geographic Data Committee (FGDC) as the lead entity in the federal government for the development, implementation, and review of policies, practices, and standards relating to geospatial data. The FGDC has years of working with federal statistical, program, and administrative agencies to devise data standards related to collection, sharing, use, dissemination, and mitigation of risk.37
36 See: https://x12.org/
37 See: https://www.fgdc.gov/standards and also standards for metadata and interoperability: https://www.fgdc.gov/metadata; https://www.fgdc.gov/what-we-do/develop-geospatial-sharedservices/interoperability/gira
This page intentionally left blank.