The United States Needs a New National Data Infrastructure
An informed citizenry is foundational to a modern democracy. Information about the welfare of the population—its health and safety, educational achievement, occupational skill distribution, employment status, wealth, housing status, and hundreds of other attributes—guides the assessment of a country’s well-being. Information about the status of the economy similarly prompts judgment: Are firms growing? Are they investing in the future? Are they planning new ventures? Are new startups prevalent? What is the state and trend of economic inequality? Are federal resources allocated to subpopulations fairly and according to societal priorities, and are they distributed to locations where they are most needed? To assess current conditions as well as the performance of elected officials and the policies they pursue, citizens of democracies require authoritative and trustworthy statistics. The absence of such facts can leave citizens vulnerable to misinformation and disinformation—a threat to democracy itself.
In all democracies throughout the world, central governments have the responsibility of collecting data and making statistical information widely available to the populace. Credibility is a key attribute of information distributed from the government to the public. To achieve credibility, a set of essential principles, policies, and procedures insulates the collection of such statistical information from political interference. In some countries, a central bureau of statistics is protected by laws that grant legal independence to the methods and inquiries of statistical bureaus (e.g., The Statistics
Act passed by the Parliament of Canada in 1918).1 Such laws distinguish between administrative uses of data, in which individual-level data can be used for purposes such as determining program eligibility, and statistical uses of data, which produce aggregate estimates about populations. In the United States, the Confidential Information Protection and Statistical Efficiency Act of 2002 (U.S. Congress, 2002b) and other statutes ensure that data provided by respondents to federal statistical agencies for statistical purposes are confidential.
Government statistical agencies often articulate a set of principles, practices, and standards to ensure the trustworthiness and credibility of government statistics. For example, the seventh edition of Principles and Practices for a Federal Statistical Agency identifies “credibility among data users and stakeholders” and “independence from political and other undue external influence” as two of five well-established and fundamental principles (National Academies of Sciences, Engineering, and Medicine, 2021, pp. 5–6).
To enhance trust, laws prohibit statistical agencies from using data for enforcement purposes or intervening in the activities of individuals or businesses. Instead, statistical uses of data typically describe large populations and are constructed using aggregates of individual data. Statistical information should be relevant to policymakers but not espouse specific policy recommendations. Information should be timely and useful to decisionmakers, while also providing an accurate description of the country’s full population. Much statistical information tracks change over time for some feature or attribute of the country; hence, information should be consistent over time and location. Measurements upon which statistical information is based should accurately assess the concepts in question, that is, measurements should be fit for use. Together, these practices ensure the credibility of government statistics.
Historically, statistical agencies of central governments designed their own de novo measurement systems (e.g., for determining employment status). Registers listing each member of a given target population identify the persons or organizations eligible to be measured. Statistical agencies carefully construct these sampling frames to include all eligible members of a given target population—a hallmark of modern government statistics. Statistical sampling techniques identify a subset of the eligible members that represent the full population. Data collection typically involves self-administered questionnaires or interviews in every sampled unit. Statistical theories prescribe that unbiased statistics are produced when each sampled member of the population is measured. Consequently, statistical agencies expend great effort to contact, gain cooperation, and measure sampled units.
1 See: https://laws-lois.justice.gc.ca/eng/acts/S-19/FullText.html
United States government statistical agencies produce more than just statistical information for monitoring the status of the country’s society and economy. The data assets of these agencies also create a research infrastructure for the empirical social and economic sciences and are used by government, academic, and nonacademic research organizations. The research uses of data, with the continuous exercise of strict privacy protections, have generated important discoveries that have directly influenced society’s awareness and understanding of key issues and informed policymakers’ proposals and programs. These include new insights about job creation and destruction, social mobility, job-to-job flows, housing affordability, the gig economy, health and education outcomes, morbidity and mortality, welfare and the wellbeing of children, crime and crime victimization, employer and household dynamics, and much more. Research findings also help the public separate fact from fiction. This research infrastructure, supported by statistical-agency data assets, generates essential shared knowledge about the country.
The U.S. federal statistical system has been the primary producer of statistics and an important research facilitator, but the nation lacks the data infrastructure needed to meet the demands of the 21st century.
Recent innovations in computer science and data analytics, combined with an explosion of available digital data, have set the stage for a reexamination of the infrastructure that produces the nation’s statistics and supports vital social and economic research. The country’s emerging data assets, growing expertise in accessing high-dimensional data, and the pressing need to address evolving societal threats (e.g., pandemics, social injustice, and climate change) call for envisioning a new data infrastructure that produces more timely, granular, and relevant statistical information.
A new data infrastructure will mobilize the nation’s relevant data assets by accessing data across sectors, to improve existing statistical products and create new ones—all in scientifically sound ways that incorporate enhanced privacy protections for data subjects and holders. Over time, statistical agencies have improved individual products by expanding their collections, and, occasionally, using administrative records and private sector data. Unfortunately, these remain exceptions, not the rule. The availability of new data assets and technologies is not enough—in the panel’s opinion, the United States should use new data assets and technologies within a coordinated, new, national data infrastructure, to meet the information and research needs of the 21st century.
In the panel’s judgment, the time is right to develop such a reformed, enriched national data infrastructure. Moreover, the federal statistical
system faces myriad and worsening challenges that demand building a new data infrastructure now, not later. The following sections will describe the state of the existing federal statistical system, the challenges it faces, and the impetus for change.
Producing National Statistics: Declining Response Rates and Increased Costs
In the panel’s opinion, U.S. statistical agencies’ reliance on sample-survey data and census data is unsustainable. The statistical theories that underlie these methods do offer strong support for measuring a statistical sample of a population and show that such methods can produce high-quality descriptions of the full population, all while protecting confidentiality. In the 20th century, this approach served the world well (e.g., Rao and Prasad, 1986; Biemer, 2010).
However, inferences based on sample surveys require full measurement of the sample drawn. If some sample units are not measured, the theories require further assumptions that contemporary surveys cannot satisfy. This basic requirement of survey sampling has led to expensive efforts by government statistical agencies to increase participation and response rates. Unfortunately, most of these efforts have failed, regardless of investment, leading to incomplete measurement and increasing the risk of inaccuracies in statistical information. Table 2-1 shows the declining response rates for several important household surveys. It should be noted that the overall response rate is generally unrelated to the separate problem of nonresponse bias, in which only a nonrepresentative group of sample members participate (e.g., Czajka and Beyler, 2016). In a telling example, the pandemic exacerbated these declines in participation and led the U.S. Census Bureau to suspend the release of the 2020 one-year American Community Survey (ACS) estimates (U.S. Census Bureau, 2021a) and, for the first time, delay the release of the five-year ACS products (Bahrampour, 2021). Experimental estimates for 2020 were released on November 30, 2021, but the estimates are not comparable with prior one-year estimates (U.S. Census Bureau, 2021b, 2021c).
The increasing costs of obtaining participation and flat or declining budgets have led to the elimination, or threat of elimination, of multiple important programs and surveys (Box 2-1). For example, in 1996, the National Vital Statistics System, part of the National Center for Health Statistics, suspended the collection of detailed national records-based data on marriages and divorces (Centers for Disease Control and Prevention, 2022). In 2008, after publishing fourth-quarter 2007 estimates, the U.S. Census Bureau terminated its quarterly survey measuring residential alterations, improvements, and repairs (U.S. Census Bureau, 2007). In the absence of official
TABLE 2-1 Selected Household Survey Response Rates
|Survey Data||CPSa||CPI Housingb||CE Interviewc||MEPS HCd||ACS-Annual|
|Jan 2012||90.4||66.2||71.3||61.3 (overall)||97.3 (weighted)|
SOURCE: Response rates were found on the websites of the U.S. Bureau of Labor Statistics (for CPS, CPI Housing, and CE Interview columns, see https://www.bls.gov/osmr/responserates/household-survey-response-rates.htm), the Agency for Healthcare Research and Quality (for MEPS HC column, see https://meps.ahrq.gov/mepsweb/survey_comp/hc_response_rate.jsp), and U.S. Census Bureau (for ACS-Annual column, see https://www.census.gov/acs/www/methodology/sample-size-and-data-quality/response-rates/).
a Current Population Survey, U.S. Census Bureau.
b Consumer Price Index Housing Survey, U.S. Bureau of Labor Statistics.
c Consumer Expenditure Survey, U.S. Bureau of Labor Statistics.
d Medical Expenditure Panel Survey, Household Component, Agency for Healthcare Research and Quality.
statistics, private sector estimates of the size of the home-improvement marketplace vary widely. For 2020, private sector estimates ranged from $150 billion (Statista, 2022) to $325–333 billion (Joint Center for Housing Studies, 2020). The elimination of the U.S. Bureau of Labor Statistics (BLS) Mass Layoff Statistics program, a BLS-state cooperative program, resulted in the loss of a standardized approach across states to identify, describe, and track the effects of major job losses (U.S. Bureau of Labor Statistics, n.d.). With the loss of the Information & Communication Technology Survey, there are no longer official annual estimates of information, communication, and technology equipment or software purchases—a huge and growing market (Market Research Store, 2021). According to a report commissioned by the Census Project, a nonpartisan advocacy group, the future of the ACS is threatened (Hoeksema et al., 2022). Experts argued that the ACS, a survey central to the nation’s data infrastructure, needs an additional $100–300 million in funding to address current limitations and introduce much-needed enhancements. Other programs, such as the Survey of Income and Program Participation, have lost funding, regained funding, experienced periodic funding shortfalls in the late 1980s, and were later redesigned and continue to this day (U.S. Census Bureau, 2021d).
Declining response rates,2 increasing collection costs, program reductions, and government continuing resolutions that freeze funding at the prior-year level and increase agency uncertainty, combined with the inability of federal statistical agency budgets to satisfy the growing demand for more timely and granular information, have generated a vicious circle. In the panel’s judgment, national statistics that depend solely on sample surveys are unsustainable. There is little hope of maintaining information flow to the American public and decisionmakers without a fundamental change in the way statistics are produced.
The Digital Data Revolution Presents Opportunities and Challenges
While surveys and censuses are experiencing increasing risk of error and spiraling costs, other data are being produced at unprecedented rates by a variety of data holders. Some of these data arise from the operations of federal, state, and local government agencies—the “administrative records” that identify individuals or businesses, and the information they report to these programs or agencies (e.g., taxing authorities and benefit-payment providers). For example, linked administrative and survey data could provide insights about eligibility and access to the Supplemental Nutrition and Assistance Program (Bhaskar et al., 2021), or the impact of social security cutoffs on youth engagement with the criminal justice system could be
2 See Czajka and Beyler (2016, p. 11) for discussion of reasons for declining response rates.
examined using a data infrastructure that integrates U.S. Census Bureau surveys and federal administrative data with state and local criminal justice administrative records (Deshpande and Mueller-Smith, 2022).3 As another example, the effects of climate change could be measured by linking survey, census, and administrative data (Voorheis, 2021).
Administrative data from federal, state, and local governments can be an important source for statistical uses and evidence building (the National Academies, 2022b), and international statistical organizations are leveraging new data sources (Commission on Evidence-Based Policymaking, 2017; the National Academies, 2017b). In addition, local and city governments are already using data to make smarter, more informed decisions.4 Yet, most of these federal, state, and local administrative-data assets remain untapped by the federal statistical system, and their use is often prohibited by statute for statistical uses or research.
Further, a much larger set of data is being produced in the private sector each minute, including vast amounts of transaction data—credit card transaction data, point-of-purchase information, customer loyalty programs, consumer purchasing histories, as well as troves of credit-monitoring data. The transaction data of e-commerce businesses includes product descriptions, prices, and quantity data, as well as information about the seller and purchaser, the transaction date, shipping location, and mode of transport. The housing market has also seen an explosion of data. Consumers can view satellite images of their homes and see their estimated values. They can shop, buy, sell, and arrange financing online. CoreLogic,5 a data broker, has historic information on 5.5 billion property records, over one billion of which are updated every year. Online real estate brokers, like Redfin, use their data resources to produce housing-related statistics (Lambert, 2022). Health-related industries are also generating voluminous data. Electronic health records (EHRs) have grown exponentially and are shared among providers, clinicians, pharmacies, and patients. Acquiring and using EHRs in statistical surveys has been difficult (DeFrances and Lau, n.d.), but EHRs are available for other purposes. For example, the National Institutes of Health’s All of Us6 research program invited more than 1 million Americans to share their EHRs for research. The program recently released the first genomic dataset for 100,000 highly diverse whole-genome sequences (National Institutes of Health, 2022). Finally, mobile geolocation services embedded in nearly every app are another source of comprehensive
3 The Criminal Justice Administrative Records System data infrastructure is described in Finlay et al., 2022.
4 For examples, see: https://datasmart.ash.harvard.edu
5 For more information, see: https://www.corelogic.com/why-corelogic/
6 For more information, see: https://allofus.nih.gov/about/program-overview
individual data. These services can underpin individual mobility and provide data that can reveal intricate socio-behavioral phenomena (Valentino-DeVries et al., 2018). For example, retail scanner data have been used to determine household obesity status (Page et al., 2021).
The above examples illustrate the broad data revolution that is occurring and highlight current opportunities to enrich the data resources available for producing national statistics. Some new data sources contain data that households and businesses are asked to report in agency sample surveys, but they can also include more expansive and more timely data that could enrich existing statistical programs. Generally, these private sector data have been available to federal statistical agencies only through negotiated purchases or other bespoke contract mechanisms. Unlike federal and state administrative data, the use of private sector data is generally not limited by statute. These data have desirable properties that complement the attributes of surveys and censuses—almost all generate data in a more timely fashion than surveys. Further, some private sector data provide records of behavior, in contrast to survey responses in which accuracy may be impacted by a lack of records or respondent recall (Grotpeter, 2007; Biemer et al., 2013; Snijkers et al., 2013). Private sector data may also include information about segments of the population that are poorly represented in sample surveys.
However, these nonofficial data have weaknesses not shared by censuses and surveys produced by government agencies. Official censuses and surveys are designed to achieve the measurement needs of the agencies, covering well-defined populations. Administrative data are often designed for a specific purpose, such as tax administration, and may only cover a subset of the population (Liao et al., 2020). Furthermore, statutes generally protect privacy by limiting data access and uses to specified purposes. In contrast, data from the private sector are often generated for operational reasons (e.g., e-commerce transactions). These organizations aim to serve their customers; they do not attempt to collect similar data from an entire household population, from all producers, or for all products. For example, cash sales are often excluded from digitally available retail transaction data. Data from private sector firms typically only include data from their customers. Further, data from private sector companies and data holders can change over time in response to business needs, in ways inconsistent with a statistical program’s needs for temporal consistency and comparability. Private sector data may disappear completely if business processes change, if a firm is sold or goes out of business, or if a firm simply decides to stop selling or sharing its data. Generally, commercial data are also limited in the number of attributes measured—they are often less descriptive than the multivariate richness provided by surveys and censuses. Additionally, private sector data may have quality issues or lack quality measures or adequate documentation.
Consequently, these two forms of data—responses (to surveys and censuses) and transactional records (from daily processes)—offer distinct strengths and weaknesses. Statistical surveys offer strong coverage of an entire population and measure many attributes on the sample units, but they are expensive, slow to produce information, and suffer from nonparticipation. Administrative data produced as part of federal or state programs are designed for a specific, nonstatistical purpose and, even if they achieve full participation, they might not represent the population. Local government and private sector organizations may offer more timely information, but only about those engaged in specific transactions or covered activities; these data may contain few attributes describing their customers. Similarly, data collected by nonprofit organizations and academic institutions or through crowdsourcing and citizen science have their particular strengths and weaknesses.
How can a data infrastructure take advantage of the strengths of these data assets and compensate for their weaknesses? Blended data combine information from at least two separate data assets. Careful blending of data from multiple, complementary sources, such as statistical surveys and censuses, administrative agencies, and the private sector, offers a way to generate more detailed, timely, and useful statistical information than is currently available (e.g., Tam et al., 2020).
Current Efforts to Use Digital Data to Repair Weaknesses in National Statistics Demonstrate the Possibilities and Limitations of Alternative Data Sources
The past several years have seen multiple attempts by researchers at statistical agencies to blend diverse data sources with existing survey and census data. Many of these collaborations are one-time agreements between holders of new digital resources and individual researchers.
Ron Jarmin (2019), U.S. Census Bureau, noted the important role that researchers have played in embracing alternative data sources:
Unsurprisingly, researchers have been faster than the statistical agencies to adapt alternative, and especially government administrative data to various economic measurement tasks. There already have been a large increase in the utilization of administrative data for research (Chetty, 2012) and policy evaluation (Jarmin and O’Hara, 2016). Examples include analyses of trends in income inequality and the changing nature of business dynamics. Often these studies use administrative data to study patterns that simply are not available from existing survey-based data—moreover, would be prohibitively expensive to generate in a survey context. In the examples just mentioned, longitudinally linked microdata with universe coverage permit much more precise descriptions of the underlying dynamics than would be possible with survey data.
Importantly, these research efforts can and do lead to innovations in official statistical products. For example, early work on matched employer-employee data (Abowd, Haltiwanger, Lane, 2004) led to the development of the Quarterly Workforce Indicators which integrate many sources of information including administrative data from state unemployment insurance records and survey-based data from the American Community Survey7 (Jarmin, 2019, p. 168).
Federal statistical agencies have recognized the importance of blending survey data with federal and state administrative data assets. BLS’s Quarterly Census of Employment and Wages,8 for example, blends survey data from the Multiple Worksite Report and the Annual Refilling Survey with administrative data provided by state unemployment insurance agencies. In a further extension, the U.S. Census Bureau’s On the Map for Emergency Management,9 an innovative byproduct of the matched employer-employee data mentioned above, integrates administrative, survey, and disaster-related real-time data from the National Weather Service’s National Hurricane Center, the U.S. Department of the Interior, the U.S. Department of Agriculture, and the Federal Emergency Management Agency.
The panel’s December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure (see Appendix B for agendas) provided multiple examples of private sector data use by federal statistical agencies and units. All 13 designated statistical agencies, except for the Social Security Administration’s Office of Research, Evaluation, and Statistics, are currently using private sector data assets (Reamer, 2021). The Bureau of Economic Analysis (BEA) uses some 142 private sector assets, and the Energy Information Administration (EIA) uses approximately 80 private sector data assets, while the U.S. Census Bureau uses approximately 20 private sector data sources. Extensive use of private sector data by BEA and EIA has long-standing historical roots—since the 1930s for BEA and the 1970s for EIA (Reamer, 2021). Statistical agencies reported that they use private sector data for multiple purposes, including “to supplement or combine with existing agency-held data” (82%), “to better understand other indicators of the economic environment” (71%), “for verification, quality control or quality assurance for existing data estimates” (53%), or to “continue current reporting capacity of agency practices” (53%; Reamer, 2021).
The growing practice of blending private sector data assets with administrative and survey data presents challenges for statistical agencies. These include costs, legal and procurement hurdles, problematic documentation,
7 Note that references within this quote are from the cited source and are not included in the reference list for this report.
8 See: https://www.bls.gov/cew/
and data quality problems. In the workshop sessions, presenters highlighted the following challenges:
- Data acquisition, access, and use across the federal statistical system are fragmented, inefficient, sometimes redundant, and largely uncoordinated. Even within agencies, procurement processes can be time-consuming and complex.
- Using private sector data is currently difficult, challenging, and often expensive; inadequate or poorly documented metadata and technical obstacles make linking and blending data challenging.
- Laws and regulations remain major obstacles to accessing and using federal statistical agency-restricted data assets, federal program and administrative datasets, and state and local government datasets.
- Some data holders share their data, but vast spheres of activities have not yet been explored; shared data assets may not be representative.
- Data holders generally demand payment for data shared; costs vary significantly and sometimes increase substantially over time.
- Data-use agreements are often single-use and have no inherent replicability or sustainability.
- Data storage is siloed, expensive, and inefficient.
- The use of blended data requires new methods, statistical designs, privacy-preserving and confidentiality-protecting methods and tools, new skills and expertise, and possibly new organizational models.
- Privacy preservation and security protocols are inconsistent and vary across sectors, data holders, and agencies (the National Academies, 2017b, Ch. 5; O’Connor, 2018).
To meet these challenges in blending private sector data with surveys and other data sources, statistical agencies are actively sharing best practices and lessons learned. Such recent efforts to blend survey data with other data sources are consistent with the recommendations of several recent reports, detailed in the next section.
REPORTS RECOMMEND THE USE OF BLENDED DATA
Many recent studies evidence the need for blended data to improve statistical information. The first report of the National Academies’ Panel on Improving Federal Statistics for Policy and Social Science Research Using Multiple Data Sources and State-of-the-Art Estimation Methods recommended combining data assets of federal, state, and local governments with private sector sources (the National Academies, 2017a, p. 44).
The report also recommended the creation of a new entity charged with facilitating access to and use of multiple data sources by statistical agencies and researchers (the National Academies, 2017a, p. 104). The panel’s second report, issued in the fall of 2017, assessed alternative approaches for creating an environment that would blend diverse data sources (the National Academies, 2017b). The report described statistical methods and models needed to combine data; examined statistical and computer-science approaches that foster privacy protections; evaluated frameworks for assessing the quality and utility of alternative data sources; and considered various options for implementing a new organization to facilitate data sharing.
A report from the Markle Foundation (2021) also supported the use of blended data to better meet the needs of policymakers, researchers, and the public. This report found that “reducing barriers and increasing capacity could significantly advance privacy-protected data sharing, help address disparity and inequality, and improve knowledge on how to increase economic mobility” (Markle Foundation, 2021, p. 4). During 2020, the Markle team conducted five expert working-group sessions and multiple one-on-one calls with experts to identify opportunities fostered by an improved data ecosystem:10
- Increasing equity in federal data, to better understand and address racial disparity and other inequities;
- Improving the accessibility and use of state data, to obtain insights into programs and benefits provided at the state level and to allow state and local policymakers to better understand and meet needs across geographies and populations;
- Increasing engagement with the public and community stakeholders on data collection, use, and reuse; and
- Leveraging new data (including private sector data), and creating new economic measures.
Active research and development projects are also helping to propel the use of blended data to improve government statistics and address issues of national importance. One example is the recently announced modernization of the U.S. Census Bureau’s residential construction statistics program, which was possible only by blending multiple data sources (Darr, 2022). Rather than collecting residential housing permits, which would include data from 9,000 permit-issuing organizations, the U.S. Census Bureau will
10 The Markle team used a slightly different terminology in its report than is used here. It focused on the government or federal “data ecosystem,” not a vision of a new national data infrastructure.
receive data from third-party sources and introduce a small cutoff sample to supplement the third-party data. Satellite imagery (using geolocation/georeferenced data dimensions), rather than data collected by telephone interviewers, will be used to identify the start of construction. This approach was adopted after collaborating with Statistics Canada. When fully implemented, the U.S. Census Bureau will publish more granular statistics, including construction permit statistics for every jurisdiction in the U.S., rather than just for states.
A second example illustrates that data sharing may advance understanding of current supply-chain bottlenecks. While existing statistical surveys and programs (including import and export statistics) collect data from important supply-chain participants, these statistics do not illuminate supply-chain logistics. A recent White House/industry partnership, the Freight Logistics Optimization Works project, is a data-sharing initiative that will pilot data exchanges between parts of the goods-movement supply chain, to produce a proof-of-concept by the end of summer 2022 (The White House, 2022a). The supply-chain pilot may provide insights about bottlenecks that cannot be gleaned from statistical surveys or programs, but additional benefits may be possible by linking pilot results with existing survey-based data and administrative data sources. Such linking could provide additional insights into the characteristics of key supply-chain participants.
Despite evidence that the blending of diverse data assets for statistical purposes is already a growing component of the federal statistical system and other parts of government,11 there is no cohesive, coordinated plan to build a new data infrastructure to provide statistical agencies or the broader community access to the diverse, relevant data sources needed to further this practice. The current state of the national data infrastructure prevents the United States from fully realizing the promise of blended data. A new national data infrastructure is needed that supports and facilitates the use of blended data to produce more timely, granular, and useful information.
RECENT CONGRESSIONAL AND DATA-RELATED INITIATIVES: NECESSARY BUT NOT SUFFICIENT
The important work of the Commission on Evidence-Based Policymaking (CEP) resulted in statutory change. CEP was established in 2016 to develop a strategy for increasing the availability and use of government data
11 The Federal Geographic Data Committee has been dealing with blending geospatial data for some time as Ivan DeLoach, workshop participant and former director of the Federal Geographic Data Committee, noted at The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure Workshop on December 9th, 2021.
to build evidence about government programs while protecting privacy and confidentiality. The Commission’s report, The Promise of Evidence-Based Policymaking, includes numerous recommendations for improving data access in a secure, privacy- and confidentiality-protected manner; modernizing privacy protections for evidence-building; implementing a National Secure Data Service (a service provider, not a data warehouse) within the U.S. Department of Commerce; and strengthening federal evidence-building capacity (Commission on Evidence-Based Policymaking, 2017). CEP used the phrase “evidence” to mean the use of statistical information for evaluating alternative policy decisions faced by the government’s executive and legislative branches. More broadly, however, CEP recommendations also focus on improving access and use of federal and state government-collected and federally controlled data for statistical purposes, including statistical production, research, and evidence-building activities. As a result of CEP’s report, the president signed the Foundations for Evidence-Based Policymaking Act of 2018 (hereafter, the Evidence Act) on January 14th, 2019 (U.S. Congress, 2019). The Evidence Act implemented about half of CEP’s recommendations and provides statistical agencies with a broader statutory basis for accessing and using data assets of federal nonstatistical agencies: “The head of an agency shall, to the extent practicable, make any data asset maintained by the agency available, upon request, to any statistical agency or unit for purposes of developing evidence” (U.S. Congress, 2019, Section 3581(a)).
The Evidence Act also expanded secure access to datasets covered by the Confidential Information Protection and Statistical Efficiency Act for approved statistical purposes, including evidence and evaluation uses, consistent with existing laws and regulations (U.S. Congress, 2002a, Section 3582). The Evidence Act encourages sharing, for statistical purposes, and among designated federal agencies and entities, of data assets “created by, collected by, under the control or direction of, or maintained by the (federal) agency” (U.S. Congress, 2019, Section 3511(a)(1)), consistent with existing laws and regulations.
Title II of the Evidence Act establishes open data as the default for public data assets, unless restrictions or limitations exist, such as protecting confidentiality or national security. The Evidence Act requires agencies (statistical and nonstatistical) to develop and maintain a comprehensive inventory of all federally held and controlled data assets (including metadata) and access privileges. The Evidence Act also requires the U.S. Office of Management and Budget (OMB) to issue regulations on data classification by federal agencies, according to data sensitivity, and to provide access accordingly. Finally, the Evidence Act requires federal agencies to develop evidence-based policy and evaluation plans and to designate evaluation officers, statistical officials, and chief data officers to support and implement these new requirements.
The director of OMB was directed to establish a standard access process “through which agencies, the Congressional Budget Office, State, local, and Tribal governments, researchers, and other individuals, as appropriate, may apply to access the data assets accessed or acquired under this sub-chapter by a statistical agency or unit for purposes of developing evidence” (U.S. Congress, 2019, Section 3583).
According to the Evidence Act, providing access to data assets is “for purposes of developing evidence.” The Evidence Act defines “evidence” as “information produced as a result of statistical activities conducted for a statistical purpose” (U.S. Congress, 2019). OMB later provided federal agencies with additional guidance regarding Evidence Act implementation and broadly defined “evidence” to include:
- Foundational fact-finding—foundational research and analysis, such as aggregate indicators, exploratory studies, descriptive statistics, and basic research;
- Performance measurement—ongoing, systematic tracking of information relevant to policies, strategies, programs, projects, goals/objectives, and/or activities;
- Policy analysis—analysis of data, such as general-purpose surveys or program-specific data, to generate and inform policy (e.g., estimating regulatory impacts and other effects); and
- Program evaluation—a systematic analysis of a program, policy, organization, or components of these, to evaluate effectiveness and efficiency (OMB, 2019). In short, any of these uses would meet the Evidence Act’s definition of a legitimate “statistical purpose.”
The Evidence Act also established the Advisory Committee on Data for Evidence Building (ACDEB), which currently has 26 members from federal agencies, state and local governments, academia, nonprofits, and the private sector. ACDEB is directed to “…evaluate and provide recommendations to the Director [of OMB] on how to facilitate data sharing, enable data linkage, and develop privacy enhancing techniques” (U.S. Congress, 2019). The first meeting of ACDEB was held in October 2020, and the Committee met monthly through November 2021 and bi-monthly thereafter. ACDEB is scheduled to terminate in October 2022.
On June 28th, 2021, the U.S. House of Representatives passed the bipartisan National Science Foundation for the Future Act (U.S. Congress, 2021)12 that included the National Secure Data Service (NSDS) Act. The NSDS Act directs the National Science Foundation (NSF) director, in consultation with the chief statistician, to establish a demonstration project
12 The NSDS Act was not acted on by the U.S. Senate and, thus, did not become law.
within a year of enactment, to “develop, refine, and test models to inform the full implementation of the CEP recommendations for government-wide data linkage and access infrastructure for statistical activities conducted for statistical purposes.”13 At a January 21st, 2022, ACDEB meeting, NSF announced that America’s DataHub Consortium (ADC) would be the demonstration project and would be located in the National Center for Science and Engineering Statistics within NSF (Arora, 2022). At a meeting on May 20th, 2022, it was reiterated that ADC would be the pilot for the NSDS. On August 9th, 2022, the president signed the CHIPS and Science Act of 2022, which authorized and funded an unnamed NSDS demonstration project with language nearly identical to that used in the original NSDS Act. Section 10375 of the CHIPS and Science Act “establishes a National Secure Data Service demonstration project to test models and inform the full implementation of a government-wide data linkage and access infrastructure” (U.S. Congress, 2022a).
ACDEB issued a Year 1 Report on October 29th, 2021, including recommendations and a roadmap for Year 2 activities (Advisory Committee on Data for Evidence Building, 2021). At the ACDEB meeting on January 21st, 2022, OMB and the Interagency Council on Statistical Policy (ICSP) verbally responded to the Year 1 Report, announcing that they would engage with ACDEB iteratively, focusing on the Standard Access Process, ADC, and several other ICSP initiatives. It is unclear how this new iterative approach will influence the ACDEB Year 2 Roadmap included in the October 2021 Year 1 report. Consistent with the Evidence Act and CEP, ACDEB did not address possible improvements through the use of private sector data for national statistics.
ACDEB discussed the Biden Administration’s priorities, including presidential memoranda and executive orders that impact the Panel’s work (The White House, 2021a,b). The Executive Order on Advancing Racial Equity established the Equitable Data Working Group, an interagency group co-chaired by the U.S. chief statistician and the U.S. chief technology officer, to identify “inadequacies in existing Federal data collection programs, policies, and infrastructure across agencies, and strategies for addressing any inadequacies identified” (The White House, 2021b). The Equitable Data Working Group14 released its recommendations in a report issued on April 22nd, 2022 (The White House, 2022b). The second report in this
13 Excerpted from full bill text: https://www.congress.gov/bill/117th-congress/house-bill/3133/text
14 The Equitable Data Working Group’s activities and recommendations were reviewed by the panel, but did not contribute to the report.
series, The Implications of Using Multiple Data Sources for Major Survey Programs, will discuss data equity issues.15
While the federal statistical system is still heavily dependent on surveys and censuses, the survey-centric paradigm faces major challenges associated with declining response rates, rising costs, and the inability of budgets to keep pace with increasing data demands. Administrative data sources have been leveraged by some statistical programs, especially business statistics programs such as the Economic Census and BLS’s Quarterly Census of Employment and Wages program.
The examples of statistical agencies using blended data—both those described in this chapter and those discussed during the National Academies’ December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure—have shown promising results. However, broader use is often limited by statute, complicated negotiations and contract mechanisms, lack of methods or expertise, or unwillingness to share data. It is indisputable that the opportunities are prodigious. While the Evidence Act provided statistical agencies with a broader statutory basis for accessing and using data assets of federal nonstatistical agencies (U.S. Congress, 2019, Section 3581), more than two years after its enactment, a major advance has not been achieved.
The United States needs a new 21st century national data infrastructure that blends data from multiple sources to improve the quality, timeliness, granularity, and usefulness of national statistics, facilitates more rigorous social and economic research, and supports evidence-based policymaking and program evaluations. (Conclusion 2-1)
State, local, territory, and tribal governments also have data assets that could benefit the federal statistical system. CEP recognizes the importance of state administrative data, but its state-related recommendations were not enacted. Consequently, statistical agency access to many state and local data assets is limited by statute; in some cases, access is limited or constrained by the lack of resources and expertise of state and local governments. ACDEB has discussed at length the importance of state and local government data assets for the federal data ecosystem. In the panel’s opinion, state and local
15 For a video recording of the May 16th and May 18th workshop, see: https://www.nationalacademies.org/event/05-16-2022/the-implications-of-using-multiple-data-sources-for-major-survey-programs-workshop
government data assets are an essential component of a new national data infrastructure.
A key portion of the National Academies’ December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure focused on statistical agencies’ uses of private sector data for official statistics and research. CEP, the Evidence Act, and ACDEB deliberations and recommendations do not identify private sector data as a key component of a new national data infrastructure. Workshop participants noted that private sector data utilization for national purposes might greatly improve the quality, timeliness, and granularity of national statistics, as well as improve knowledge of groups that are not well represented in existing surveys. Private sector data can also support important scientific discoveries by facilitating rigorous research, and recent expert groups have recommended the inclusion of private sector data. In the panel’s judgment, private sector data assets are an essential component of a new data infrastructure. Untapping private data sources poses challenges but also provides unique opportunities to improve national statistics by leveraging existing information and blending it with other data sources.
This chapter has illustrated that blending multiple data sources to produce new statistics is a growing practice of the federal statistical system and needs support to expand further. The data assets available for blending in a new data infrastructure include those held by the federal statistical, program, and administrative agencies; state, tribal, territory, and local governments; private sector companies; nonprofit and academic institutions; and crowdsourced and citizen-science data. At this time, however, the United States has no cohesive, coordinated plan to ensure that novel, blended data become an essential and growing source of public information and research. The current state of the national data infrastructure prevents a realization of the promise of blended data. Data acquisition, access, and use are siloed, inefficient, and largely uncoordinated. Laws and regulations remain major obstacles to accessing and using federal statistical, program, and administrative data as well as state, local, and tribal government data. Private sector data use is bespoke and often costly, with no inherent sustainability. Most data holders have no incentives to contribute or share their data for the common good. Privacy-protecting behaviors of data holders are highly variable and largely unregulated, and there is little transparency and accountability for private sector data use.
In the panel’s judgment, the current national data infrastructure is ill-equipped to meet the data needs of the 21st century. The reliance on the federal statistical system on statistical surveys as a primary data source is unsustainable. To meet the demands for credible, trustworthy, and timely statistical information, the United States needs a new data infrastructure that facilitates the increased blending of data from multiple sources.