National Academies Press: OpenBook

Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good (2023)

Chapter: 4 Blended Data: Implications for a New National Data Infrastructure and Its Organization

« Previous: 3 A Vision for a New National Data Infrastructure
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

4

Blended Data: Implications for a New National Data Infrastructure and Its Organization

In the panel’s vision, a 21st century national data infrastructure should support blending data from multiple sources to provide accurate, timely, and relevant information. Blended data occur when at least two different data assets are combined to produce statistical information. This chapter describes diverse data assets that can be combined for statistical purposes; the criteria that govern data acquisition, access, and use; and the implications of blended data on the components and capabilities of a 21st century national data infrastructure, as well as the associated privacy and ethical challenges. The chapter ends with a consideration of various organizational structures that may facilitate cross-sector data access and use.

KEY DATA HOLDERS FOR A 21ST CENTURY NATIONAL DATA INFRASTRUCTURE

As described in Chapter 2, statistical agencies are already blending data from multiple sources, consistent with the recommendations of several expert reports. This section describes the scope of data assets that the panel recommends being included in a new infrastructure as well as the holders of those data assets. Data holders include federal statistical agencies; federal program and administrative agencies; state, local, tribal, and territory governments; private sector companies including data brokers; nonprofits and academic institutions; and crowdsourced and citizen-science data holders.

Before describing data holders, a few words on data subjects. Respecting the presence and rights of data subjects—the people, entities, or organizations described by the data—is essential to building widespread trust

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

in a new data infrastructure. Distinct types of data subjects require distinct kinds of considerations. First, data assets may relate to, describe, or be associated with an identified or identifiable individual, consumer, or household. Second, data may relate to, describe, or be associated with an identifiable business, including corporations, partnerships, limited liability companies, and sole proprietorships. Third, data assets may describe a physical structure including residential (single-family and multifamily), nonresidential, private, nonprofit, or government-owned structures. Fourth, data may relate to a specific process, system, or application. Web survey paradata (i.e., data that document the measurement process) are an example of process-related data that could also potentially identify a data subject. For example, an agency’s paradata could relate to an individual, household, or business; or to a statistical agency’s information about a portal, a survey, a web instrument, a question, a survey-specific item, information about the device used by a respondent, or the combination of any of the above. The concerns, interests, and special considerations needed to account for data subjects are covered in subsequent sections.

The data holders listed below have data assets relevant to the panel’s vision of a new data infrastructure. The panel acknowledges that, individually, all data assets are likely to have weaknesses, but a careful blending of data from multiple complementary sources, such as statistical surveys and censuses, administrative agencies, and private sector enterprises, can emolliate the weaknesses of any single data source. Blending these multiple data sources offers new opportunities to generate more timely, granular, and useful statistics for the common good.

Principal Federal Statistical Agencies and Units

In the panel’s vision of a new data infrastructure, the existing and future data assets of designated statistical agencies and units (as shown previously in Box 3-3) should be available for blending, subject to strong privacy protections and ethical considerations, with other data. Data assets would include identifiable and privacy-protected data files, metadata, and paradata “created by, collected by, under the control or direction of, or maintained” by the 13 principal statistical agencies and designated units (U.S. Congress, 2019).

The traditional data assets of statistical agencies have many desirable attributes. These data were designed to serve national statistical informational needs. They are derived from high-quality registers and sampling frames, have strong coverage properties, measure many attributes of the respondents, and use statistical methods to generate high-quality estimates. They have existing legal underpinnings vetted by U.S. Congress and the executive branch. However, as noted in Chapter 2, statistical surveys and

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

censuses are slow to produce information, expensive, and suffer from declining participation.

Statistical agency data assets are also structured with a well-defined data model (i.e., the locations of attributes within data records are documented and described). For the purpose of the panel’s vision, associated metadata should describe the data source, format, variables, data elements, questions, processes, methods, limitations, and more. Statistical agency data assets also include semi-structured paradata files (that describe details of the measurement process) and possibly unstructured data, such as text descriptions of property attributes in land property descriptions. Access to and use of statistical agency microdata files is generally restricted by law or regulation; privacy and confidentiality are legislatively protected (U.S. Congress, 1974, 2019).

Section 3582 of the Foundations for Evidence-Based Policymaking Act of 2018 (hereafter, Evidence Act) directs statistical agencies to share restricted, secure data assets with other statistical and nonstatistical agencies for purposes of evidence building unless restricted by law (U.S. Congress, 2019). The ability to share restricted data assets among statistical agencies represents an important new opportunity to improve and transform statistical programs and operations. The Evidence Act’s sharing presumption of “yes, unless”1 is an advance over prior regulation. However, as noted by Katherine Wallman, workshop participant, the “unless” can still be a huge obstacle.

Statistical data agency assets, of course, are not without problems. A lack of complete records, ambiguous questions, or poor recall among data subjects can contribute to underlying issues. In addition, self-identification and changes over time in the meanings of race and/or ethnicity,2 gender, and industry or occupation can contribute to measurement challenges.

Federal Programs and Administrative Agencies

Program-based federal agencies (e.g., the U.S. Department of Agriculture’s Supplemental Nutrition Assistance Program) possess data assets that could complement and extend existing statistical programs, generate new products, and expand data assets available to researchers. These data are often termed administrative data. Blending administrative data with data collected by statistical agencies is an active area of innovation in federal statistical agencies and in the research community.

The U.S. Census Bureau has been using federal tax data in the quinquennial economic censuses program since the mid-1950s, and in building

___________________

1 If a statistical agency seeks a federal data asset for statistical purposes, the requested data must be provided unless specifically prohibited by law.

2 For discussions of measurement implications, see Prewitt (2013) and Alba (2020).

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

its business register since the early 1970s. The National Academies of Sciences, Engineering, and Medicine discussed in detail the benefits and challenges of using government administrative data for federal statistics, and described the use of administrative data in other countries (the National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 3). That report noted the impressive added value of blending survey data with government-program data.

Without imposing additional reporting burden, government administrative data have been used to update business and address frames and to provide universe statistics, such as nonemployer statistics. These data have been used for editing and imputation of survey responses or missing items, as a source of auxiliary information in statistical models, for survey evaluation, and to guide data-collection efforts in the conduct of surveys and censuses. The 2020 decennial United States census used administrative data to guide the number of nonresponse follow-up contacts, to inform proxy responses for nonrespondents, and to check data quality. However, administrative data also have limitations—lack of quality control or quality measures, coverage limitations, missing records, concepts or definitions that may differ between statistical surveys, lack of timeliness, and high processing costs (Liao et al., 2020). Linking survey and administrative data sources can help identify such problems and lead to improvements in administrative data sources.

By statute, the U.S. Office of Management and Budget (OMB) requires agencies to look for alternative data sources before conducting a new survey.3 However, despite the spirit of the Evidence Act’s directive to expand use of existing data for statistical purposes, the Act does not override current statutory prohibitions regarding sharing. For example, the Internal Revenue Code 6103(j) regulations permit the U.S. Census Bureau to use tax data for a limited, specified set of purposes, but does not permit the Bureau to share tax data or survey data comingled with tax information with the Bureau of Labor Statistics (BLS) or the Bureau of Economic Analysis (BEA; U.S. Congress, 2022b), even though the 2002 Confidential Information Protection and Statistical Efficiency Act (CIPSEA; U.S. Congress, 2002b) permitted the sharing of business data among the three agencies. Consequently, the U.S. Census Bureau and BLS maintain separate business registers that are not reconciled, which complicates the blending of data assets and products across the agencies.

Data synchronization legislation was drafted to revise the Internal Revenue Service (IRS) regulation so that the U.S. Census Bureau could share limited business tax data with BLS and BEA. This legislation, while

___________________

3 See 5 C.F.R.: 1320, https://www.govinfo.gov/app/details/CFR-2016-title5-vol3/CFR-2016-title5-vol3-part1320

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

proposed multiple times since 2002, has never been enacted,4 foregoing a major opportunity to improve economic statistics and cut costs (American Economic Association, 2021).

In 2017, the Commission on Evidence-Based Policymaking (CEP) recognized the importance of administrate data as an additional data source for evidence-building (Commission on Evidence-Based Policymaking, 2017). Box 4-1 provides a “cradle-to-grave” listing of selected CEP administrative data sources, with the Small Business Administration (SBA’s) Paycheck Protection Program5 added.

The Evidence Act addressed a major barrier to data access by providing statistical agencies with a broader statutory basis for accessing and using data assets of other federal agencies (U.S. Congress, 2019, Section 3581)

___________________

4 Legislation was pushed in in 2014, see Federal Register notice proposing a rule change to 6103(j)(1)(A): https://www.federalregister.gov/documents/2014/07/15/2014-16597/disclosures-of-return-information-reflected-on-returns-to-officers-and-employees-of-the-department

5 For more information on the Paycheck Protection Program, see: https://www.sba.gov/funding-programs/loans/covid-19-relief-options/paycheck-protection-program/ppp-data

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

unless prohibited by statute. Yet, most of these administrative data assets remain untapped for use beyond their home agencies, and they are often unavailable by statute for statistical uses or research. This contrasts starkly with Statistics Canada, which has adopted an “administrative data first” policy (Statistics Canada, 2015).

State, Tribal, Territory, and Local Governments

State, tribal, territory, and local governments possess data assets that could help produce blended national statistics by facilitating more granular, sub-national statistics, thus enriching localities’ understanding of their social and economic conditions. For example, local governments and cities are using data to make smarter, more informed decisions.6 In the panel’s vision, a new data infrastructure should include such state, tribal, territory, and local government data assets, creating blended statistics of greater value.

Provision of funding to states, tribal lands, local governments, and territories could incentivize such sharing by helping these data holders to use information, establishing two-way data sharing, and thus adding value for local decisionmaking (Moyer, 2021). Capacity building at the state and local levels, as suggested by the Advisory Committee on Data for Evidence Building (ACDEB), could make a significant impact on data quality, with benefits to both administration of state and local programs and the quality of national statistics (Advisory Committee on Data for Evidence Building, 2021).

For example, BLS funds states to clean and share their unemployment insurance employer records so that BLS can compile these administrative data into the Quarterly Census of Employment and Wages (QCEW). BLS uses this series to construct its Business Register, the universe frame for its current surveys of businesses. The QCEW is a by-product of the federal-state unemployment insurance partnership, supplemented by two BLS surveys—the Multiple Worksite Report and the Annual Refiling Survey.7 To take another example, statistics on the national prison population use state correctional data sent to the Bureau of Justice Statistics (BJS) to provide descriptive statistics on the national correctional population. BJS is currently engaged in a significant effort to improve crime reporting by transitioning from the Uniform Crime Reports to the National Incident-Based Reporting System,8 but this initiative requires cooperation and capacity from local criminal justice organizations. At the time of this writing, there has been insufficient support to develop the high-quality

___________________

6 See: https://datasmart.ash.harvard.edu

7 See: https://www.bls.gov/respondents/mwr/ and https://www.bls.gov/respondents/ars/faqs.htm

8 For more information, see: https://www.fbi.gov/services/cjis/ucr/nibrs

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

data product that BJS envisioned. State resource and data-sharing challenges have significantly limited the impact of this important program (Moyer, 2021).

However, many valuable state, tribal, local, and territory government data are not accessible for statistical purposes. Like federal administrative data assets, many state-collected and state-maintained high-value administrative data assets, including those associated with federally funded programs, have statutory restrictions on access and use and are not funded to adequately curate their data. For example, unemployment insurance wages and claims records are not available to BLS (ILR School, 2021).

Similarly, BLS and the U.S. Census Bureau are prohibited from accessing the National Directory of New Hires, which contains person-level wage records compiled from all 50 states and the District of Columbia.9 Even though federal statistical agencies routinely request administrative data from states and localities, states and localities are under no legal obligation to provide those data—even if the states’ data collections are federally funded—given the federalized design of governments. There is no default that state-collected data from programs funded by the federal government must be shared back with federal agencies. Statutory changes are necessary to achieve this outcome. The Commission on Evidence-Based Policymaking’s report, The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policymaking, recognized the value of blending data from state administrative agencies to create blended statistical products: “The Congress and the President should enact statutory or other changes to ensure state-collected administrative data on quarterly earnings…be available for statistical purposes and through a single federal source” (Commission on Evidence-Based Policymaking, 2017, Recommendation 2-6, pp. 44–45). Further, “The President should direct Federal departments that acquire state-collected administrative to make them available for statistical purposes” (Commission on Evidence-Based Policymaking, 2017, Recommendation 2-7, p. 45).

For the following discussions of a new data infrastructure, the panel assumes that these CEP recommendations will be realized (see Chapter 5). This report outlines a more comprehensive vision of a new data infrastructure built upon the foundation of CEP’s proposals. The panel concludes, like CEP, the Markle Foundation, and earlier National Academies’ Committee on National Statistics reports, that a new data infrastructure should include state, tribal, territory, and local government data assets, creating blended statistics of greater value.

___________________

9 The directory is compiled by the Office of Child Support Enforcement in the U.S. Department of Health and Human Services. The data are used for enforcement purposes, as well as for specific program integrity, implementation, and research programs.

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Private Sector Enterprises

Over the past few years, the growth of digital private sector data has vastly overwhelmed the growth of federal statistical agency data. In its December 2021 workshops on The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure, the panel considered opportunities, lessons learned, and challenges associated with using private sector data and blending private data with survey and administrative data. In an earlier report, the National Academies recommended that “Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits of using private sector data source” (the National Academies, 2017a, p. 64). Similarly, the Markle Foundation report (see Chapter 2) recommends leveraging new data, including private sector data (Markle Foundation, 2021).

As discussed in Chapter 2, 12 out of the 13 designated federal statistical agencies are using private sector data, and these uses can be expected to increase; for example, BEA reported the use of 142 different private sector data assets (Reamer, 2021). BEA’s Health Satellite Account blends survey data from the Medical Expenditure Panel Survey with data from a private insurance company and Medicare claims data (Bohman, 2021). As another example, the U.S. National Survey of Early Care and Education conducted by the Administration on Children and Families measures the availability and use of childcare facilities. Real estate and property tax data from Zillow were used to enhance the quality of this traditional sample survey of households and providers (Datta et al., 2020).

As the panel learned in its December 2021 workshops, outside the United States, similar work is occurring at Statistics Netherlands, the U.K. Office of National Statistics, and Statistics Canada. Statistics Canada is requesting weekly store-level point-of-sale data from selected retail industries, to improve consumer price indexes (Statistics Canada, 2021). These initiatives both improve economic statistics and provide the basis for statistics of value to private firms. In other cases, Statistics Canada, like U.S. statistical agencies, pays private companies for access to data. Statistics Netherlands’ Statistics Act includes broad powers requiring companies to share data with the Central Bureau of Statistics.10

At its 2022 conference, the American Economic Association Committee on Statistics (AEAStat) recognized the potential benefits of using high-frequency private sector data to “modernize official statistics.”11 AEAStat

___________________

10 For more information on Netherlands’ Central Bureau of Statistics, see: https://www.cbs.nl/en-gb/about-us/organisation

11 A video of the session can be found here: https://www.aeaweb.org/conference/2022/aea-session-recordings/player?meetingId=732&recordingId=1236&VideoSearch%5Bpage%5D=0

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

proposes that statistical agencies “connect” to data behind companies’ firewalls using software that accesses the companies’ data lakes and provides aggregated statistics, not microdata, to the statistical agency or a third-party data repository. AEAStat has proposed a demonstration project involving 5–10 large retailers, to test the idea’s feasibility.

On February 21st, 2022, the European Commission published a call for evidence, related to the European Statistical System—making it fit for the future.12 The document requests feedback regarding the proposal to make new data sources available for official statistics and statistical purposes. The proposal would extend the provisions of the recently proposed Data Act (European Commission, 2022). Among the provisions of the Data Act, if eventually enacted, is to require compulsory business-to-government data sharing for official statistics.

For all the promise of commercial data, private sector data are not without limitations. Like administrative data, private sector data are collected for a purpose different from that of data for use in a national data infrastructure. Business interests often preclude companies from capturing data about everyone, which introduces notable biases and equity challenges in the data. The data items companies capture do not always enable easy linkage or use the same standards common among federal data users or they may lack adequate documentation or metadata, all complicating replicability. Moreover, multinational corporations must abide by the laws of each involved nation, many of which prevent the provision of residents’ or citizens’ data to foreign governments without explicit consent. Another challenge of relying on data from private firms is that changes in firm strategy, management, or ownership can disrupt data sharing, either because of changes to the collected data or changes in the willingness of firms to share data. In the panel’s view, these technical, jurisdictional, organizational, and equity challenges must be evaluated and considered before acquiring private sector data assets. Chapter 2 described the limitations of private sector data highlighted during the panel’s December workshops. Still, private sector data can fill gaps in other data sources, and they remain an untapped asset.

As mentioned earlier, CEP and the Evidence Act are silent regarding accessing and using private sector data. Benefits from a blended approach would include a more comprehensive understanding of important national conditions, smoother trends, more reasonable estimates, and more data granularity. The panel concluded, as did the earlier evaluations mentioned above, that private sector data assets offer opportunities to improve national statistics and support more rigorous social and economic research.

___________________

12 See: https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/13332-European-Statistical-System-making-it-fit-for-the-future_en

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Data Brokers

Data brokers collect, buy, aggregate, and sell data on individuals and companies for profit. Data brokers collect information from a wide variety of public records, including arrest records, marriage licenses, property records, building permits, and digital sources like cookies, browser fingerprinting, web beacons, and IP address tracking (Melendez and Pasternack, 2019). Data brokers also purchase and resell customer data from other companies, notably during bankruptcy auctions.

Combined information is sold for a variety of purposes, including verifying identity, marketing products and services, building consumer profiles, and detecting fraud. Data brokers sell information to a variety of customers. Some data brokers, like CoreLogic, blend diverse data sources to develop innovative products. CoreLogic blends collected data from 5.5 billion property records—more than a billion visual records including aerial photos, home tours, and interactive floor plans—and several hundred analytical models that extrapolate raw data into an entire portfolio of products that CoreLogic sells to companies and government agencies, including statistical agencies.13 Experian, Transunion, and Equifax assemble data from consumers’ credit-related actions and provide reports to individuals, as well as to other businesses for advertising and marketing purposes (Irby, 2022).

Data brokers, however, rarely interact directly with consumers. Consumers generally do not provide their express permission or consent for data brokers to use their data, and consumers are often unaware of the existence or practices of such brokers. Data-broker data are notoriously rife with errors that consumers cannot correct. Data brokers are almost entirely unregulated, with no federal laws regulating businesses that buy and sell personal information. Only two states, Vermont and California, have enacted data-broker laws (Wilkie et al., n.d.).

While data brokers are private sector entities, they are discussed separately in this report because their business model and incentives for sharing data set them apart. The use of data brokers also introduces more complicated consent and ethics issues, quality challenges, and the possibility of future state and federal regulations.

Data from brokers rarely link easily with agency data, and these vendors frequently overpromise their products (Studds, 2021). One advantage of data from brokers is that brokers will have already aligned the data models from diverse public and private sources to accomplish their business purposes; however, the transformations required for alignment with statistical agencies’ data may degrade data quality and not reflect the

___________________

13 For more information, see: https://www.corelogic.com/why-corelogic/

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

priorities of those agencies. This can raise important issues of data equity, as people with few financial resources may disappear or be aggregated in misleading ways that make sense for an individual data user but that would be fundamentally at odds with the goals of a statistical agency. Requesting sample or proof-of-concept data from data brokers can help to overcome some risks, as can requesting raw data rather than derived data products. Importantly, Studds (2021) emphasized the importance of statistical agencies maintaining good relationships with third-party data holders, to fulfill the legal obligations of the statistical agency.

Data-broker data assets and the many issues they raise warrant careful evaluation before inclusion in a new data infrastructure.

Nonprofit and Academic Institutions

The data resources produced and made available by U.S. universities and research institutions are key components of the current national data infrastructure. Academic institutions create and provide access to valuable data assets, such as the Panel Study of Income Dynamics,14 the General Social Survey,15 the Health and Retirement Study,16 the National Longitudinal Study of Adolescent to Adult Health,17 the Survey of Consumers,18 and the American National Election Studies.19 Some other nonprofit research groups, like the Pew Research Center, also conduct surveys that are placed in the public domain. Many of these surveys are national in scope and provide important aggregate indicators of key characteristics of the population. Each survey also provides microdata to the statistical and research communities, either directly or through the Inter-university Consortium for Political and Social Research (ICPSR)20 or the Roper Center.

Many of these university-based studies have been innovators in the use of blended data, through activities including linking to administrative data; collecting and integrating biological, streaming, audio, visual, and video data; and working with private data holders. In several cases, collaborations between university-based studies and statistical agencies have facilitated the creation of more detailed and timely data covering subjects not well served by the existing infrastructure of statistical agencies.

___________________

14 See: https://psidonline.isr.umich.edu/

15 See: https://gss.norc.org/

16 See: https://hrs.isr.umich.edu/welcome-health-and-retirement-study

17 See: https://addhealth.cpc.unc.edu/

18 See: https://data.sca.isr.umich.edu/

19 See: https://electionstudies.org/about-us/

20 ICPSR disseminates the Family Self-Sufficiency Program, AdHealth, Survey of Consumers, and Monitoring the Future data, in some cases in addition to dissemination by the data producer. See: https://www.icpsr.umich.edu/

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Some nonprofit organizations provide direct public access to their data, including the leadership and financial features of their organizations, as in Guidestar.org.21 The current national statistical infrastructure benefits from these collaborations, which can sometimes make use of the greater flexibility outside the federal government, tap into academic talent, and access alternate sources of funding.

The promise of these data could be enhanced through increased sharing with federal and state statistical and administrative resources when permitted. Both their value as research vehicles and their worth as national informational resources could be improved by blending them with the nation’s other data resources. Ongoing and strengthened collaborations with academic organizations also provide an important source of continued innovation (Jarmin, 2019).

Crowdsourced or Citizen-Science Data Holders

The use of crowdsourced data or volunteered data purposefully collected and assembled by the public to support information assets has emerged as an increasingly significant source of data that can also be used to guide official decisionmaking. Powerful data-collection devices like cell phones, sensors, and other components of the “internet of things” are nearly ubiquitous in the modern landscape, enabling individuals, businesses, governments, and civil society to collaborate around data sourcing—from the crowd. Similarly, the increasing occurrence of citizen-science projects like the COVID Tracking Project22 across a range of social, economic, and environmental applications, offer not only new data sources but also data interpretation, validation, corroboration, and other actions that provide a rich new source for blending data. Coupled with new communications technologies and social media, these advances now reach even more people through internet connectivity, and open platforms exist not only for using data but also for creating it. Crowdsourced geospatial data are one prominent example, given the GPS features of cell phones. Public participation in spatial data creation through open mapping, such as the U.S. Geological Survey’s The National Map,23 has empowered citizens to provide knowledge and context to open-map resources (Goodchild, 2007). Crowdsourced data offer the opportunity to “collectively produce finer-grained and more expansive data sets over regional and global scales and collect data more frequently, covering long temporal extents” compared to alternative methods (Fischer et al., 2021, p. 2).

___________________

21 See: GuideStar nonprofit reports and Forms 990 for donors, grantmakers, and businesses: https://www.guidestar.org/

22 See: https://covidtracking.com/

23 See: https://www.usgs.gov/programs/national-geospatial-program/national-map

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Statistical inquiries about blending crowdsourced data with other data are now ongoing. For example, Buil-Gil et al. (2020) review the use of small-area estimation techniques to improve crowdsourced data about public safety in London. Much of these crowdsourced data do not rely on standard statistical geographic designations, such as counties. Blending geospatial data with statistical survey data and other data sources is complex24 and warrants additional attention and research.

While these technological advances have grown the amount of available data immensely, concerns about crowdsourced data and information derived from citizen science parallel the limitations of commercial data. As with private sector data, nonstandard collection realities mean that the blending of crowdsourced data into official government databases systematically suffers from a lack of trust in data quality, uncertainty about adherence to standards, and frequent incompleteness. Crowdsourced data are not always produced with an awareness of the demands required to successfully implement evidence-based decisionmaking in government agencies. However, in the panel’s opinion, public engagement through purposeful crowdsourcing could represent an opportunity for greater awareness and appreciation of a new data infrastructure or the use of data as evidence. Moreover, since many citizen science and crowdsourcing initiatives stem from a desire to support the public interest, working with these communities to co-construct standards and measure limitations may be especially productive. The COVID Tracking Project is a good example of how volunteers could identify limitations and data lags in state-reported data and provide states with timely, useful feedback.

Box 4-2 lists the data holders whose data should be available for possible inclusion in a new data infrastructure.

The Evidence Act, once fully enacted, will make the federal statistical agency and federal program and administrative data assets available to the data infrastructure, when not prohibited by law. Nonprofit tax data are already available. The IRS requires all U.S. tax-exempt nonprofits to make public their three most recent annual IRS Form 990s; access to academic institutions’ data assets of interest to the data infrastructure will occur contractually. Crowdsourced data assets are also publicly available and accessible. However, in the panel’s judgment, a new data infrastructure will not realize the promise of improved blended statistics until the data assets held by the state, tribal, territory, and local governments and the private sector are included.

___________________

24 See: United Nations Guide to Data Integration for Official Statistics: https://statswiki.unece.org/display/DI/Guide+to+Data+Integration+for+Official+Statistics

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Data from federal, state, tribal, territory, and local governments; the private sector; nonprofits and academic institutions; and crowdsourced and citizen-science data holders are crucial components of a 21st century national data infrastructure. (Conclusion 4-1)

In the panel’s ideal vision, easily accessible, comprehensive catalogs of data assets would be a key feature of a new data infrastructure. Catalogs would be easily searchable so that the public, data subjects, data holders, data users, researchers, and key stakeholders would be informed of the extent of the infrastructure. The searchable catalogs or inventories would contain metadata describing the contents of the data assets, the provenance of the data, any known limitations to the data, and which data subjects are implicated. While some progress has been made, in the panel’s opinion much more needs to be done to make data assets discoverable, accessible, and usable.

Some data inventories or catalogs of diverse data holdings do exist at the federal level. The National Archives and Records Administration (NARA), in its role as the official archive for many federal statistics, has developed the National Archives Catalog for users to search and access its collection. The catalog “searches across multiple National Archives resources at once, including archival descriptions, digitized and electronic records, authority records, and web pages from Archives.gov and the Presidential Libraries” (National Archives, 2021). Without such a catalog and related query system, accessing NARA records could not be done efficiently. However, the catalog includes only those records formally sent to NARA by federal agencies and is, thus, incomplete. Similarly, ICPSR maintains an online, searchable catalog of justice-related

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

data.25 In addition, most federal statistical agencies maintain online documentation of their data assets. Many of the restricted data assets of the federal statistical system are now described with standardized metadata in a single application portal.26

Sometime soon, there will be a comprehensive inventory of federal-government data assets. The Evidence Act (U.S. Congress, 2019, Section 3511) requires every federal agency to develop and maintain comprehensive data inventories of all their data assets. The inventory will include the name, the metadata (including all variable names and definitions), the data owner, any restrictions on the use of the data asset, and criteria used in determining why a data asset is not publicly available. The Federal Chief Data Officer (CDO) Council established a Data Inventory Working Group in March 2021 “to improve the efficiency and effectiveness of federal data inventories” (Federal CDO Council, n.d., “Goals”). The Evidence Act also requires the administrator of general services to maintain a single public online interface, called the Federal Data Catalogue. At the time of this writing, some of these Evidence Act mandates have not yet been implemented.

Currently there is also no central inventory of nonfederal-government data assets. In the panel’s opinion, the absence of such a catalog will hamper data discovery and forego opportunities for expanded data access and use.

WHICH DATA SHOULD BE INCLUDED?

Not all existing data, even if accessible, will serve the country’s important information needs. Some data contain information crucial to the future understanding of the welfare of society (e.g., health, employment); others may be less important (e.g., length of major league baseball games). Which of the key data-holding groups should be part of a new data infrastructure, and which of their held data should be prioritized in a new infrastructure? In this section, a set of criteria is proffered that the panel considers potentially useful for choosing which data to include in a new data infrastructure.

Fitness-for-Use to Produce Key Information for the Country

The priority might be data assets measuring social and economic attributes of widespread research and policymaking relevance. Some of these data are already collected but very imperfectly (e.g., prices and quantities of retail goods sold). Some of these data have been studied in one-time research projects but have never been systematically reported on a national basis (e.g., alternative measures of the gig economy, Abraham et al., 2021).

___________________

25 See: https://www.icpsr.umich.edu/web/pages/ICPSR/index.html

26 See: www.ResearchDataGov.org

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Others reveal gaps between commercial and public-sector frames (e.g., real estate development data27 versus census geography frames).

Data assets should be uniquely suited and fit for the intended use. That is, in contrast to assembling all data of potential interest, pre-specified important questions of critical importance must precede data access. Data useful to answer those questions should be given priority, in the panel’s opinion. Accessible tools, aligned incentives, and broad applicability of data provide conditions for data fit for use (Bohman, 2021). Necessity demands that the data asset generates statistical information and tangible benefits to the public, data users, and data holders.

Determining the fitness-for-use of a given data source requires clear articulation of quality standards for specific uses. In 2017, the Federal Committee on Statistical Methodology (FCSM) established a Working Group on Transparent Quality Reporting in the Integration of Multiple Data Sources, to identify best practices associated with data-quality measurement and reporting for blended data products. This work was motivated by the increasing use of alternative and blended data by statistical agencies and by the Committee on National Statistics’ Panel on Improving Federal Statistics for Policy and Social Science Research Using Multiple Data Sources and State-of-the Art Estimation, described in Chapter 2.

A National Academies’ report reviewed the long-established quality frameworks for survey data (as in Groves and Lyberg, 2010) and recommended statistical agencies adopt a broader data-quality framework while concluding that “Commonly used existing metrics for reporting survey quality may fall short in providing sufficient information for evaluating survey quality” (the National Academies, 2017b, p. 114). The report also pointed out the importance of focusing more attention on the tradeoffs among various quality dimensions, such as trading precision for timeliness and granularity rather than focusing primarily on accuracy.

The FCSM Working Group, along with the Washington Statistical Society, responded to this suggestion by sponsoring several public workshops (Brown et al., 2018). The workshops were complemented by a report by Mathematica Policy Research, sponsored by the Statistics of Income Division at the Internal Revenue Service, that examined international quality frameworks. Mathematica found that countries nearly uniformly defined quality as fitness-for-use (Czajka and Stange, 2018). Also, internationally, data quality in each of the various quality frameworks is considered multidimensional (i.e., not just reflecting the quality of measurements but also the quality of representativeness of the population of interest).

The FCSM Working Group published A Framework for Data Quality in September 2020. The report provided a framework for identifying data

___________________

27 See: https://cherre.com/

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

quality for all data, recognizing the opportunities and challenges of new data sources, and noting the growing reliance on integrating data from multiple sources. The report defines quality as “the degree to which data captures the desired information using appropriate methodology in a manner that sustains public trust” (Federal Committee on Statistical Methodology, 2020, p. 6). The definition applies to all data, data products, and analytical products. It also applies to the entire data file as well as individual data elements. It applies to traditional methods as well as new and emerging methods such as artificial intelligence and machine learning. The definition also applies to diverse data sources as well as integrated data, also often referred to as blended, combined, or linked data.

The framework uses three broad domains and 11 data-quality dimensions as shown in Table 4-1.

This framework documents threats to data quality associated with each of the dimensions included in the framework. Identifying these threats is a necessary first step in “mitigation, managing trade-offs among them, and for reporting data quality” (Federal Committee on Statistical Methodology, 2020, p. 5). The report includes best practices to identify data-quality threats.

Data Minimized to Satisfy Pre-Specified Purposes

In the panel’s view, a new data infrastructure should not result in the unbridled harvesting of all digital data that exists in the country. Instead, the data-acquisition request—the records, data elements, data granularity, and frequency—should be limited to the information needed to satisfy the proposed statistical purpose. Statistics Canada has used such a framework for the intake of data according to necessity and proportionality criteria, where proportionality means that Statistics Canada takes no more than is needed and considers the sensitivity and confidentiality of the data (Bowlby, 2021).

A similar approach could prove prudent for the United States, in the panel’s opinion. A disciplined approach to a new data infrastructure implies that the information needs of the country (necessity) determine which data items are included. Besides focusing on minimizing the volume of data records and associated data elements acquired, the minimization principle implies a judgment regarding the level of detail and the frequency of access needed to satisfy the statistical purpose(s). A statistical purpose requiring the linking of statistical-agency data assets with data holders’ microdata will be more consequential than a statistical purpose that can be satisfied with aggregated data, such as the production of monthly retail sales. That said, conditions and needs can change over time. As such, flexible contracts that include multiple-year options and the potential for expanded coverage might prove logistically prudent during negotiations with private sector data holders (Bohman, 2021).

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

TABLE 4-1 Dimensions of Data Quality

Domain Dimension Definition
UTILITY Relevance Relevance refers to whether the data product is targeted to meet current and prospective user needs.
Accessibility Accessibility relates to the ease with which data users can obtain an agency’s products and documentation in forms and formats that are understandable to data users.
Timeliness Timeliness is the length of time between the event or phenomenon the data describe and their availability.
Punctuality Punctuality is measured as the time lag between the actual release of the data and the planned target date for data release.
Granularity Granularity refers to the amount of disaggregation available for key data elements. Granularity can be expressed in units of time, level of geographic detail available, or the amount of detail available on any of many characteristics (e.g., demographic, socio-economic).
OBJECTIVITY Accuracy and Reliability Accuracy measures the closeness of an estimate from a data product to its true value. Reliability, a related concept, characterizes the consistency of results when the same phenomenon is measured or estimated more than once under similar conditions.
Coherence Coherence is the ability of the data products to maintain common definitions, classifications, and methodological processes, to align with external statistical standards, and to maintain consistency and comparability with other relevant data.
INTEGRITY Scientific Integrity Scientific integrity refers to an environment that ensures adherence to scientific standards and use of established scientific methods to produce and disseminate objective data products and one that shields these products from inappropriate political influence.
Credibility Credibility characterizes the confidence that users place in data products based simply on the qualifications and past performance of the data producer.
Computer and physical security Computer and physical security of data refers to the protection of information throughout the collection, production, analysis, and development process from unauthorized access or revision to ensure that the information is not compromised through corruption or falsification.
Confidentiality Confidentiality refers to a quality or condition of information as an obligation not to disclose that information to an unauthorized party.

SOURCE: Federal Committee on Statistical Methodology, 2020, Table ES1, p. 4.

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

Data Access and Use Respect Data Holders’ and Data Subjects’ Interests and Privacy

In the panel’s vision, a new data infrastructure’s requests to acquire, access, and use data assets for statistical purposes must respect the reputation and interests of data holders and data subjects. Private companies, for example, have minimal incentives to share data with statistical agencies, and often think the perceived risks (e.g., high costs, intrusive practices, reputation risks) outweigh any possible benefits (e.g., money or personalized statistics).

Creating devices like schemas, application program interfaces (APIs), and legal arrangements can help reduce friction and risks to data owners, as Matthew Shapiro, a workshop participant, noted. In the panel’s view, leveraging state-of-the-art privacy and security tools is essential. Respecting data-holder interests and the privacy of data subjects can be further supported by carefully considering the level of detail needed, the frequency requested, and the way data are acquired and potentially linked. In the panel’s vision of a new data infrastructure, procedures should be in place to ensure data use is responsible, ethical, and equitable. This may include working with data holders to effectively communicate with data subjects and evaluate consent-related issues. Alternatively, this may involve ensuring that the data received from data holders cannot be re-identified. A new data infrastructure should actively engage data holders to develop a range of possible approaches that could help ensure responsible data exchange.

Prioritize Easily Acquired Data That Provide Tangible Benefits

While the most important criteria for inclusion of data in a new data infrastructure involve utility to the country’s informational needs, some data access may require unusually complicated logistical challenges. Thus, in the panel’s judgment, the costs and efforts incurred by both the data holder and the statistical agency (or acquiring entity) should be proportionate to the anticipated public benefits associated with the proposed statistical purpose. Statistical agencies should have procedures in place to quantify data-holder and statistical-agency costs as well as anticipated public benefits. The use of passively collected data, for example, can reduce the burden on data holders and provide more timely statistics, particularly in the healthcare sector (Moyer, 2021). In the panel’s view, data assets should be acquired and used only if the associated costs and effort are not disproportionate to their benefits.

Access to data requires an investment of time and resources. Easily accessed data should be given priority, but using trial or sample data can help to identify potential challenges and expedite the evaluation of benefit

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

(Stevens, 2021). Private data, according to Sarah Henry, a workshop participant, are not like “gold dust” but rather like sand, abundant and requiring significant effort to make them useful. Early evaluations should consider potential choke points in data collection, the sustainability of data, the representativeness of the data, and ways to correct for bias.

Available, Usable Metadata Is Essential for Statistical Purposes

Aggregation over records containing variable entries is meaningless unless one knows what each variable means. In surveys and censuses, the meaning of entries is specified before data collection. This becomes the basis of information about the data, the “metadata” of the dataset. If a data holder uses data for a single purpose, metadata may not be as valued as in research and statistical organizations. To be used for approved statistical purposes, these data must be described by metadata. For potentially valuable data assets lacking usable metadata, the metadata need to be developed and available to possible data users. A data infrastructure entity may collaborate with the data holder to develop the necessary documentation.

In the panel’s judgment, metadata are critical for blended data uses. To be responsibly discovered, combined, shared, used, and reused, data must be described. Limitations of data must also be readily accessible to ensure that biases in individual data assets do not ripple through any analysis. Metadata, using standard reusable schemas, permit the automation of data analysis, data transfer, and aggregation. In the panel’s vision, a new data infrastructure requires a comprehensive metadata repository with a user interface to facilitate and automate data discovery, sharing, use, processing, and protection. As noted by Ivan Deloach, workshop participant, metadata may be helpful when addressing the benefits and costs related to data quality and representativeness.

While statistical agencies have attempted to establish metadata standards within their organizations, there is no single standard for the federal statistical system. A recently released report from the National Academies identifies three categories of metadata: descriptive (facilitates discovery and identification), structural (describes how compound objects are put together), and administrative (information to help manage a resource) (the National Academies, 2022). The machine-learning community has proposed using the concept of datasheets for datasets as an approach for standardizing metadata (Gebru et al., 2021).

The panel does not take a stand on the desirability of a unique metadata standard. Instead, it envisions an infrastructure that can adapt to a variety of metadata structures, as well as evolve over time. In the panel’s vision, the minimal threshold of acceptability of a metadata approach is that it informs the blending of data from multiple sources with sufficient

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

understanding of the meaning of data items and the limitations of each data asset.

Box 4-3 summarizes the criteria of data that, according to the panel’s vision, should be included in a 21st century national data infrastructure.

BLENDED DATA REQUIRE NEW STATISTICAL METHODS

There is much to learn from past and future efforts regarding the blending of multiple data sources to improve the quality of statistical information. Organizations are now discovering when blending diverse data assets can improve existing statistics, satisfy emerging or unmet data needs, decrease reporting burden, and address issues of quality, bias, and data equity. Combining diverse data sources also provides the opportunity to produce timelier, more granular, and higher-frequency statistics, as needed. For example, Antonio Chessa, a workshop participant, noted that using transaction data in the Consumer Price Index—which focuses on business-consumer transactions and increased granularity in time and item—can improve price-index methods and statistics with greater temporal and spatial detail.

Yet, in pursuit of these benefits, statistical agencies and researchers face increasing challenges as they move from using a single data source to incorporating secondary sources, such as administrative datasets—the challenges increase when attempting to incorporate additional data assets from the private sector, state, tribal, and local governments, or crowdsourced or citizen-science sources. Whenever two or more different data sources are combined, the challenges increase. Workshop presenters noted that private-sector data rarely come in the desired form even though they are often useful. Some presenters recommended that a whole set of new methods,

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

including statistical design, would be needed to ingest and integrate blended data assets on a large scale.

Fortunately, much work has already been done, both identifying the challenges of combining multiple data sources and suggesting approaches for combining them appropriately. A National Academies’ report (the National Academies, 2017b) summarized statistical methods that can be used for combining data from multiple sources, highlighting the following:

  • Record linkage: The report provides an extensive discussion of record linkage, providing numerous examples of agencies and programs that are using record linkage for research and for producing statistics.28 The examples illustrate the benefits of record linkage, but the report points out that record linkage is not a panacea. Linkage rates vary across studies and for subpopulations within studies. Record linkage usually requires that data for individual entities be available from the data sources, along with sufficient identifying information to allow records to be linked. If these criteria are not met, other methods are needed.
  • Multiple frame methods: A multiple-frame survey draws samples from two or more sampling frames, to improve coverage of the population or to decrease costs. Typically, multiple-frame surveys are associated with less privacy intrusion than record linkage, but it is important to understand possible differences in data-collection methods and the completeness of each frame. Alternative data sources could be used to construct supplemental frames at a lower cost, albeit at some loss in coverage.
  • Imputation-based methods: Another way to conceptualize combining different data sources may be by using a missing data framework and imputing the missing data based on statistical models. The report discusses statistical matching methods and software for this purpose, but staff expertise and continuous training are needed, particularly in evolving technologies common to modern computer science, including database, cryptography, privacy-preserving, and privacy-enhancing technologies (the National Academies, 2017b). The opportunities and challenges are discussed in detail.
  • Modeling techniques: In cases in which record linkage, imputation, or the multiple-frames approach are inappropriate, another statistical modeling can be used to combine aggregated statistics or with individual-records data when the data sources measure different variables (the National Academies, 2017b). The report discusses

___________________

28 An important consideration for a new data infrastructure is the attitudes of data subjects regarding linkage of their data, as presented in Fobia et al. (2020).

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

    small-area estimation methods but suggests that combining new models needs empirical testing and substantive justification.

The report recommends the following steps for the federal statistical system:

  • Systematically coordinate federal agencies’ efforts to blend multiple data sources;
  • Ensure that statistical agencies have the appropriate skills and expertise (the National Academies, 2017b); and
  • Encourage federal agencies to develop partnerships with academia and encourage external research organizations to develop methods needed for design and analysis using multiple data sources.

BLENDED DATA REQUIRE NEW STATISTICAL DESIGNS

As a new data infrastructure evolves, in the panel’s view no single data-sourcing strategy will be optimal for all informational needs. Each data source (e.g., surveys, administrative records, or private sector data) has some weakness regarding the population coverage, relevance of measurement, timeliness, and granularity of potential statistical aggregates. Past reviews (e.g., the National Academies, 2017a,b) have speculated that, after initial blended data estimates are available, a new period of statistical design might usefully take place, to find approaches that make use of the strengths of multiple sources.

For example, data on the access and delivery of health services exist from government agencies (e.g., Medicare and Medicaid) but also from household health surveys, hospital samples, electronic record platforms, and a variety of other sources. In the panel’s opinion, after new blended statistics using multiple data sources are built, that survey designs could likely be optimized, reducing original survey measurement in populations that are well measured and increasing survey measurement in populations not well covered by the various administrative record systems. In some cases, this may yield a reduction in statistical agency budgets allocated to original data collections and an increase in budgets devoted to accessing shared records.

BLENDED DATA REQUIRE NEW DATA INFRASTRUCTURE CAPABILITIES

In the 20th century, the data ecosystem of businesses and administrative government agencies rested on entities independently designing data for their particular uses. Since the data were used principally by those who designed them, data required little documentation and no preparation for

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

access by others. Since most data were used for a specific purpose at a specific point in time, consistency over time was subordinate to utility for the specified use. Any data manipulation or transformation was singularly focused on immediate organizational needs. If multiple data sources were needed, integrating software would be “hard-coded” into the design.

Government statistical agencies designed their data programs to monitor important social and economic conditions for populations of interest. The statistical data were held closely and securely by those agencies, as part of their confidentiality pledges to data holders and subjects. The statistics produced from the data were released according to preset schedules, rarely more often than monthly.

In the panel’s vision, such data silos no longer serve the needs of modern society. Features of the 20th century data infrastructure must change to achieve the panel’s vision for a new data infrastructure. As an increasing number of initiatives occur, combining data from multiple independent sources, further desirable capabilities of a new data infrastructure are articulated. Box 4-4 lists work by the United Nations’ Economic Commission on Europe’s High-Level Data Group for Modernization of Statistical Production and Services related to a Common Statistical Data Architecture (CSDA), an initiative aimed at consistently describing the data aspects of statistical production.29 The group identified high-level capabilities required by a new data infrastructure to realize the promise of blending multiple data sources. Capabilities require the interaction of organizations, people, processes, and technology and generally describe the “what and why” of statistical production, not the “how and who.”

A new data infrastructure will require enhanced capabilities. While there is much existent talent for documenting data designed for statistical uses (e.g., surveys and censuses), there is less expertise available for documenting those features of administrative and process data that were never intended to be used in statistical operations. Similarly, Box 4-4 notes the need to define and track supply chains of data, to access data in diverse locations simultaneously, and to work with a set of partners deserving ongoing support. Data integration is well exercised in some organizations, but not all, especially for datasets that were not originally designed to be used in tandem. While data governance is well documented in federal statistical agencies, it was originally designed for data assets that would be fully acquired and stored behind the agencies’ firewalls, not for a world in which data assets are too large to move from organization to organization. Finally, knowledge management—the ability to understand differences among measures found in multiple datasets—is critical for the statistical operations needed to blend data into more informative estimates.

___________________

29 See: https://statswiki.unece.org/display/DA/CSDA+2.0

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

In the panel’s estimation, these skills can be acquired by the organizations involved in a new infrastructure, but only with intentionality. For example, full engagement of the academic sector can provide critical capacities like analytical expertise, upskilling existing organizational skills, and educating the future workforce, so that agencies can operate nimbly and dynamically in a new data infrastructure.

BLENDED DATA POSE NEW PRIVACY AND ETHICAL CHALLENGES

A 21st century national data infrastructure cannot succeed without ensuring ethical exchange of data; trust in institutions involved in data

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

exchange; privacy-preserving techniques; and technical, organizational, and legal mechanisms supporting responsible data practices. For the latter half of the 20th century, these concerns were collectively categorized as “privacy.” Privacy has multiple definitions in legal and technical contexts, but colloquially the concept of “privacy” is employed when people are concerned that they lack meaningful control over a social situation and the information flowing in that context (Marwick and boyd, 2014). In the 21st century, a wider range of concepts are deployed to define privacy. Data collections and use that respect privacy are thought of as ethical, trustworthy, and responsible.

In the panel’s view, ethical treatment of data subjects requires adherence to four key values (see Chapter 3). First, the actions of a new infrastructure should be guided by attention to how use of a subject’s data will affect that subject’s life. Second, there are underlying issues of autonomy—the ability of individuals to make their own decisions. A new data infrastructure must recognize the nature of informed consent by the data subject. Third, there is a concern about beneficence—that is, to what extent will data be used to produce good outcomes for the data subject? Finally, there is a focus on human dignity—that is, are the activities of a new infrastructure conducted in a manner that is respectful of data subjects? Collectively, in the panel’s opinion, these values must underlie both policy and practice. Only after these individual concerns are addressed can the societal benefits of improved statistical information be appreciated.

While these values must be fundamental to a data infrastructure, laws provide another framework for addressing the range of social, cultural, and reputational issues at play. In the 20th century, lawmakers passed numerous bills restricting government data collection and use. The laws that govern the private sector are more diverse. Moreover, while federal data holders must only concern themselves with federal laws, multinational corporations must grapple with privacy laws in numerous countries. Countries may have overlapping data regulations based on the data subject, location of the company’s employees, and location of the company’s data centers. While some companies may be required to share data (e.g., pollution levels from a manufacturer), others operate under voluntary agreements between users and data holders (e.g., credit-reporting data), and still others are legally prohibited from sharing certain kinds of data without explicit consent (e.g., healthcare, movie rentals). The legal procedures covering privacy are both complex and incomplete. Most importantly, the limitations of privacy laws infuriate data subjects, data holders, and data users for wholly distinct reasons.

Advances in computing have increased privacy-related risks while also enabling the development of privacy-enhancing technologies. Some new processes are specifically designed to support novel forms of statistical

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

data sharing. For example, formal privacy protection provides a technical framework for balancing data accuracy and statistical confidentiality. The U.S. Census Bureau began releasing data using synthetic data generation in 2006, as part of its Longitudinal Employer-Household Dynamics program and has a formal privacy analysis of this data-anonymization process.30 The National Institutes of Health (NIH), the National Institute for Standards and Technology, and the National Science Foundation (NSF) are all working to build standards around homomorphic encryption, to enable computation on encrypted data. Investments in developing privacy-enhancing technologies—including funding from NIH, NSF, and the Defense Advanced Research Projects Agency—and applying them to various data-user scenarios are ongoing. In the panel’s judgment, tracking these technical mechanisms and integrating them into practice will increase data holders’ confidence in sharing data.

Privacy laws and technologies can help strengthen data protections. As analysts in the federal statistical system aim to improve current deficiencies by integrating and blending multiple sources of data, they have come to realize that even when data are de-identified, linking sources increases risk of re-identification (Sweeney et al., 2017). Merely connecting the site of an individualized record to the same site of another can inadvertently reveal personally identifiable information that was obscured before blending. Methodologies to balance privacy tradeoffs, such as geomasking, can address the need to protect individuals while still enabling individual-level data to be utilized or analyzed without significantly affecting statistical results (Kwan et al., 2004).

However, prior attempts to centralize federal data assets have repeatedly been thwarted by pushback under the label of “privacy.” The Privacy Act of 1974, for example, was created in direct response to a 1965 effort to create a National Data Center.31 In response to this history, the Evidence Act put privacy front and center.

Meanwhile, however, the conversation has evolved. Federal government agencies and academia are speaking of trustworthy AI,32 data ethics,33 and data equity.34 Practitioners in the industry use similar language, cognizant of how “trust” is dependent on being seen as “responsible.” Data holders are engaged in robust conversations about “data governance,” while representatives of data subjects are asking to be included in governing mechanisms.

___________________

30 See: https://lehd.ces.census.gov/applications/help/onthemap.html#!what_is_onthemap

31 For a discussion of the National Data Center and a thorough history of data-related privacy concerns see Igo (2018, Ch 6).

32 See: https://www.ai.gov/strategic-pillars/advancing-trustworthy-ai/

33 See: https://resources.data.gov/assets/documents/fds-data-ethics-framework.pdf

34 See: https://covid19.census.gov/pages/data-equity

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

In the panel’s judgment, a new data infrastructure must be attentive to—and in conversation with—the range of stakeholders engaging on these topics. While useful, privacy laws and technologies alone will not serve as an effective response to threats that could challenge the legitimacy of a new data infrastructure. Rather, all who are involved—including data subjects, data holders, and data users—must collectively negotiate best practices, governance mechanisms, and normative expectations about data exchange. This requires creating and sustaining a governing body (or set of bodies) tasked with building processes and practices, sustaining relationships with stakeholders, and ensuring that trade-offs are collectively negotiated.

MULTIPLE ORGANIZATIONAL STRUCTURES CAN SUPPORT A NEW DATA INFRASTRUCTURE

In this section, the panel does not suggest a specific organizational model but instead identifies assumptions regarding the data assets, capabilities, attributes, and services of a new data infrastructure. In the panel’s opinion, CEP’s recommendations are consistent with the necessary attributes of a new data infrastructure but are insufficient to form the foundation of this infrastructure. CEP recommended broader access to federal administrative data for statistical purposes and sharing of statistical data resources among the federal statistical agencies, and also recommended that the National Secure Data Service (NSDS) be established, housed within the Department of Commerce, to blend multiple data sources for improved statistics and research. CEP further recommended sharing state-based earnings data for statistical purposes (Commission on Evidence-Based Policymaking, 2017).

The first legislation passed based on CEP’s report, the Evidence Act, did not itself create the Commission’s proposed NSDS. However, there have been multiple commentaries on alternative organizations that might house NSDS. Modernizing U.S. Data Infrastructure: Design Considerations for Implementing a National Secure Data Service to Improve Statistics and Evidence Building (Hart and Potok, 2020) made the case for a new data infrastructure and discussed the establishment, attributes, and organizational options associated with the creation of NSDS. Four approaches were considered:

  • Establishing a new agency in the U.S. Department of Commerce;
  • Re-tasking an existing agency in the U.S. Department of Commerce;
  • Creating a new federally funded research and development center (FFRDC) at NSF; and
  • Launching a public-private partnership in a university consortium.

Of these approaches, Potok and Hart recommended the establishment of NSDS as a new federally funded FFRDC at NSF, leveraging the existing

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

legal authorities of the National Center for Science and Engineering Statistics, a principal statistical agency covered by CIPSEA (Hart and Potok, 2020).35

The Evidence Act did not mention NSDS but did establish ACDEB, to advise the OMB director and “to review, analyze, and make recommendations on how to promote the use of Federal data for evidence building” (U.S. Congress, 2019, Section 315). ACDEB’s Year 1 Report included a vision, framework, and resources associated with NSDS (Advisory Committee on Data for Evidence Building, 2021). At monthly public meetings in 2021, ACDEB discussed the idea of NSDS as an FFRDC located in NSF; however, the Year 1 Report did not comment on the organizational form or the exact organizational placement of NSDS. At ACDEB’s meeting in January 2022, NSF announced that the newly established America’s DataHub Consortium would serve as a demonstration project for NSDS—sponsored by the National Center for Science and Engineering Statistics, a statistical unit at NSF (Arora, 2022). According to NSF, the “consortium model benefits all levels of government and prioritizes innovation” (Arora, 2022, p. 13). The consortium structure appears to allow flexibility to bring together various organizations and individuals, within and outside of government.

While the organizational structure and location of NSDS remain uncertain, there is an agreement regarding several important issues. First, a National Academies’ Committee on National Statistics panel (the National Academies, 2017a), CEP, and ACDEB rejected the idea of establishing a national clearinghouse or data warehouse, due to untenable privacy risks. Any new entity must provide a shared service that permits authorized users to conduct temporary data linkages for exclusively statistical purposes. Second, there is widespread agreement regarding the necessary attributes of NSDS, closely aligning with the eight attributes described by Hart and Potok (2020): transparency and trust, legal authority to protect privacy and confidentiality, independence, legal authority to collect data from agencies, scalable functionality, sustainability, oversight and accountability, and intergovernmental support.

CEP and ACDEB have focused on using federal, state, and local data for evidence building, proposing the establishment of NSDS to bring these data assets together. In the panel’s opinion, a comprehensive vision of a new data infrastructure is incomplete without addressing how the blending of private sector data with other data assets might improve the country’s understanding of its current situation and prospects. Thus, it is essential to consider the implications of the addition of private sector data assets on organizational options, organization type, and organization placement.

___________________

35 This discussion of the placement of NSDS was subsequently expanded to include five more-detailed organizational models (Potok and Hart, 2022).

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

In sum, the panel’s vision of a new data infrastructure involves the addition of important data assets in addition to those currently legislatively endorsed in the Evidence Act (i.e., federal statistical and federal program data), including relevant state, tribal, territory, and local government data assets; relevant private sector data assets; data assets from nonprofits and academic institutions; and crowdsourced or citizen-science data assets.

The panel assumes that, in a new data infrastructure, NSDS will be implemented and will process state, tribal, territory, and local government data as well as federal government data. At the time of this writing, however, the form NSDS will take is not fully evolved, and various organizational models are possible.

Organizational Models to Facilitate Cross-Sector Data Access and Use

The panel’s vison of a new data infrastructure should tap assets as necessary, from all sectors of society that produce digital data about the state of the country. Such an infrastructure was not anticipated by the organizational structure of the current federal statistical system. Without organizational change, blended data to improve the current data infrastructure will remain siloed and all too rare. In its vision for a new data infrastructure, the panel assumes that the federal government will eventually implement CEP’s recommendations related to an NSDS. Further, the panel assumes that state and local government data will be added to the mission of NSDS. This implies that NSDS will act as an access portal to federal, state, tribal, territory, and local government data assets. But a new data infrastructure also requires a sustainable organizational model for accessing relevant private sector data for common-good statistics, which raises additional challenges.

There are several alternative organizational options for a facility within a new data infrastructure that will produce blended statistics using federal, state, tribal, territory, local, private sector, nonprofit, and academic institution held data as well as crowdsourced data assets. While the panel does not endorse any option, at least one new facility will be required to facilitate the blending of data to create new statistical products, and this entity (or network of entities) will be a key component of a new data infrastructure. For ease of exposition, this report will use the singular terms “facility” or “entity”, even though the panel recognizes that there may be multiple coordinated entities. In the panel’s judgment, when considering organizational options for the facility, the seven attributes articulated in the vision of a new data infrastructure must be met. These include the privacy-protecting practices and legal reforms necessary to support the infrastructure’s authorities, data governance and standards frameworks, and transparency of operations central to its success. Finally, each model facility must be able to access and

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

accommodate federal government statistical and administrative data; state, tribal, territory, and local government data; nonprofits and academic institutions data; private sector data; and crowdsourced or citizen-science data.

In assessing organizational models for a new data infrastructure service, possible iterations of a suitable infrastructure should be considered. CEP and ACDEB have articulated their vision of NSDS, but what are the structural implications of blending state, tribal, territory, local government, and private sector data with data accessible through NSDS? The panel considered seven potential organizational models.

Option 1: NSDS Coordinates Access to All Data Sources

In this vision, NSDS (a combined, comprehensive new entity within the data infrastructure, established with the guidance of ACDEB, with rulemaking mandated by the Evidence Act) would have authority over data access from all sectors. This option assumes that NSDS has successfully evolved to provide access to all federal, state, tribal, territory, and local government program and administrative data to be blended with survey and census data for solely statistical purposes. In this model, accessing all data-holder data—including private sector data, data from academic institutions and nonprofits, and crowdsourced or citizen-science data—is included in NSDS’s authorities. This model is agnostic to whether the provision of additional data sources for statistical purposes is mandated or incentivized for participant organizations (though, in the panel’s opinion, “incentivized” is most realistic). NSDS would develop in partnership with nongovernmental organizations and government agencies to help set access and use protocols. All data-access and privacy-protection procedures would rest on CIPSEA and enhanced legal sanctions.

Option 2: The New Entity Is Placed at a Principal Statistical Agency or a Unit Within Such an Agency

CEP recommended that NSDS be located in the U.S. Department of Commerce as a new principal statical agency leveraging data assets and expertise of the U.S. Census Bureau, BEA, National Institute of Standards and Technology, and the National Oceanic and Atmospheric Administration (Commission on Evidence-Based Policymaking, 2017, Rec. 2-1). CEP added that NSDS “should be situated in such a way as to provide independence sufficient to set strategic priorities distinct from any existing Commerce agency and to operate apart from policy and related offices” (Commission on Evidence-Based Policymaking, 2017, p. 41). In its exploration, the panel was agnostic to where NSDS is housed but—if this route is pursued—the panel recommends that it be housed within a structure with sufficient

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

expertise, funding, and organizational infrastructure to establish and sustain the effort.

Option 3: Authority for the New Entity Is Placed at a New Federally Funded Research and Development Center (FFRDC)

As noted earlier, Hart and Potok (2020) recommended the establishment of NSDS as a new FFRDC at NSF, leveraging the existing legal authorities of the National Center for Science and Engineering Statistics, a principal statistical agency covered by CIPSEA. Other locations are possible if the FFRDC has a federal sponsor and is delegated the authority to provide all NSDS-like services to federal, state, tribal, territory, and local government data holders, as well as to private sector and other data holders. In the panel’s view, if this approach is pursued, the FFRDC should have all the rights and responsibilities of a federal statistical agency, coverage under CIPSEA, and access to all data holdings. Such an FFRDC (or FFRDCs) would be expected to develop access protocols in partnership with private-sector organizations, academic partners, and government agencies.

Option 4: A Public-Private Partnership in a University Consortium

Assuming NSDS is established (Option 1), the panel can imagine the creation of a new 501(c)(3) or other 501-section nonprofit organization with the sole purpose of facilitating the secure use of nongovernmental data (including private sector data) for blending with data resources controlled by NSDS. This nonprofit would be governed by a representative body composed from the organizations whose data are accessed by the entity. The panel recommends that such an approach contain a community advisory board to increase accountability to populations whose data are being used. The transparency features of this public-private partnership would include real-time documentation of analyses currently in operation and statistical products being produced. All statistical products from data supplied by the entity could be published for public consumption on the entity’s website.

Option 5: NSDS Along with Private Data Compiled Within Given Sectors

This option would rest on a group of sector-specific consortia overseen by NSDS. Each consortium would be a node for blending of sector-specific data with other data pertinent to the industry. Individual data-sharing companies within each sector would not have access to the shared data. However, sharing data with the consortia could be reciprocated by provision of a set of statistical products of value to the sector, with strict privacy protections for all. Each consortium would establish agreements for data

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×

sharing that would benefit national statistics. The negotiation of these agreements could be a responsibility of NSDS or the statistical agency whose work could be enhanced with a given sector’s data.

Option 6: NSDS Along with Data Gathered by Regional Affiliates

This option is similar to Option 5 but is organized in geographical sets. Such networks might be of interest to activities that affect one another spatially (e.g., healthcare delivery systems or workforce data). Such regional hubs exist and are sharing data (e.g., Cunningham et al., 2022). For example, the Michigan Education Data Center is a secure data clearinghouse that helps researchers use Michigan’s education data.36

Option 7: A Data Trust

A data trust is a structure whereby data is placed under the control of a board of trustees with a fiduciary responsibility to look after the interests of the beneficiaries. Using a trust offers data holders a greater say in how their data are collected, accessed, and used by others. A data trust goes further than limiting data collecting and access to protect privacy; it promotes the beneficial use of data and ensures benefits are widely felt across society. The data trust form may be overlaid on some of the above structures, like regional or sector options. Trusts must determine who can collect data, who can make decisions about future data collection, who can access data and decide future data access, and who can decide future data use (Ruhaak, 2019). Several of the other options might also take the form of a data trust.

To be successful, any of the models described above will require active engagement with data holders and data users, diverse stakeholders, effective governance, and exemplary transparency. Each model also raises unique questions, including those related to ensuring privacy and security (Box 4-5).

SUMMARY

This chapter discussed the data assets of a 21st century national data infrastructure, including how those assets are sourced and evaluated. Statistics from blended data are central to the panel’s vision, and the blending of data has implications on statistical methods, statistical designs, and data-infrastructure capabilities. This chapter also described the privacy and ethical challenges associated with blended data and the organizational structures that could facilitate access and use of multiple data sources.

___________________

36 See: https://medc.miedresearch.org/

Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 75
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 76
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 77
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 78
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 79
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 80
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 81
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 82
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 83
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 84
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 85
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 86
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 87
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 88
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 89
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 90
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 91
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 92
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 93
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 94
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 95
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 96
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 97
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 98
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 99
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 100
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 101
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 102
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 103
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 104
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 105
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 106
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 107
Suggested Citation:"4 Blended Data: Implications for a New National Data Infrastructure and Its Organization." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. doi: 10.17226/26688.
×
Page 108
Next: 5 Building a 21st Century National Data Infrastructure Requires Identifying Short- and Medium-Term Activities »
Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good Get This Book
×
 Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good
Buy Paperback | $24.00 Buy Ebook | $19.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Historically, the U.S. national data infrastructure has relied on the operations of the federal statistical system and the data assets that it holds. Throughout the 20th century, federal statistical agencies aggregated survey responses of households and businesses to produce information about the nation and diverse subpopulations. The statistics created from such surveys provide most of what people know about the well-being of society, including health, education, employment, safety, housing, and food security. The surveys also contribute to an infrastructure for empirical social- and economic-sciences research. Research using survey-response data, with strict privacy protections, led to important discoveries about the causes and consequences of important societal challenges and also informed policymakers. Like other infrastructure, people can easily take these essential statistics for granted. Only when they are threatened do people recognize the need to protect them.

Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good develops a vision for a new data infrastructure for national statistics and social and economic research in the 21st century. This report describes how the country can improve the statistical information so critical to shaping the nation's future, by mobilizing data assets and blending them with existing survey data.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!