Page 83 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

4

Creating New Data Resources with Administrative Records

The Foundations for Evidence-Based Policymaking Act of 2018 identified the potential for reducing burden on survey respondents by making use of administrative records already being collected for other purposes. The United States began using administrative data for statistical purposes even before the first decennial census in 1790; Box 4-1 lists some historical developments related to the use cases in this report. This chapter describes innovative projects that have used (or are planning to use) decennial census and administrative records to provide information that would otherwise be measured in a household survey, or not measured at all. The chapter relies in part on the workshop session Opportunities for Using Multiple Data Sources to Enhance Major Survey Programs.

Section 4.1 describes the potential for creating longitudinal datasets from administrative records, illustrating the concept with examples of three databases created by the U.S. Census Bureau, and outlines additional linkage challenges involved when using data from multiple time points. Section 4.2 discusses a proposed U.S. Census Bureau project to link together four databases (demographic, geographic, jobs, and businesses) to provide the basis for a more integrated research program. Section 4.3 describes how a culture of innovation in one of the oldest U.S. data-combination programs, the National Vital Statistics System, is producing new and improved data products. Section 4.4 provides examples of linkage at state and regional levels and identifies challenges involved in combining data collected using different standards and protocols. Section 4.5 concludes the chapter with a summary of common themes

Page 84 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

BOX 4-1
Historical Uses of Administrative Records for Statistical Purposes: Selected Examples

Administrative records have been used to produce statistics in the United States since the nation’s founding. Many of the early milestones involved producing statistics from newly established administrative data collections. The last 50 years have seen an increase in the use of administrative records in combination with other data sources. Here are some selected highlights:

On July 31, 1789, the first U.S. Congress approved “An Act to regulate the Collection of the Duties imposed by law on the tonnage of ships or vessels, and on goods, wares and merchandises imported into the United States,” which directed customs collectors “to receive all reports, manifests and documents made or exhibited to him by the master or commander of any ship or vessel, … to make due entry and record in books to be kept for that purpose, all such manifests and the packages, marks and numbers contained therein.” (U.S. Congress, 1845, p. 36, emphasis added). Treasury secretary Alexander Hamilton transmitted the first statistical summaries of these data to Congress in January 1791, cross-classifying tonnage and duties by vessels’ nationality and the state receiving the goods (Hamilton, 1791). Cummings (1918) reviewed the proliferation of statistics from administrative data that followed these initial statistics on foreign commerce.
In 1880, the U.S. Census Office established a federal-state cooperative data system that still operates today: a national death-registration area consisting of states and cities providing death statistics deemed of sufficient quality to be tabulated. A national birth-registration area was established in 1915. In 1880, the death-registration system contained only two states (Massachusetts and New Jersey) and a few large cities, but by 1933, all 48 states and the District of Columbia had been admitted to the birth- and death-registration system and each was reporting at least 90 percent of deaths (Hetzel, 1997; National Research Council, 2009, Appendix B; Rothwell, Freedman, & Weed, 2014). In 1946, responsibility for statistics about births and deaths was transferred to the U.S. Public Health Service; today, the National Vital Statistics System is coordinated and guided by the U.S. National Center for Health Statistics (see Section 4.3).
Following the authorization of an income tax in the 16th Amendment of the U.S. Constitution, the Revenue Act of 1916 called for “the preparation and publication of statistics reasonably available with respect to the operation of the income tax law and containing classifications of taxpayers and of income, the amounts allowed as deductions and exemptions, and any other facts deemed pertinent and valuable” (U.S. Congress, 1917, p. 776). The first statistical report was issued in 1918 (Dalton, 2007), and the Statistics of Income Division in the Internal Revenue Service (IRS) continues to publish aggregate tax information.
The first volume of the Uniform Crime Reports, containing statistics voluntarily contributed by police departments using the uniform crime classifications

Page 85 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

specified by the International Association of Chiefs of Police (1929), was published in 1930 (see Chapter 7).
The 1935 Social Security Act set up a system of continuous data reporting by employers, presenting an opportunity to develop a longitudinal dataset to study work history. The Continuous Work History Sample, initiated in 1941 by the predecessor of the Social Security Administration (SSA), is the oldest major longitudinal dataset in the United States, with data extending back to 1937. It is a probability sample of administrative records, consisting of one percent of all Social Security Numbers (SSNs) ever issued, along with demographic and geographic information about the persons with those SSNs and annually updated information on earnings, benefits, and payroll-tax contributions (Perlman, 1951; Smith, 1989; Compson, 2022).
In the 1940s, the U.S. Census Bureau began expanding the use of administrative data for estimating U.S. and state population sizes in noncensus years (county population estimates were added in the 1960s; see U.S. Bureau of the Census, 1947a,b, 1967, 1968). Postcensal population estimates are calculated by adding births, subtracting deaths, and adding net migration to estimates from the most recent census. They currently rely on information about births and deaths from the administrative records in the National Vital Statistics System (see Section 4.3), and on information about domestic migration obtained from tax and Medicare records.^a
After World War II, the U.S. Census Bureau “started making extensive use of record files from the Internal Revenue Service and Social Security Administration to develop mailing lists for economic census and surveys, and, eventually, to provide aggregate data, as in the County Business Patterns program and for smaller establishments in the economic census” (Kilss & Alvey, 1984, p. 1). In the early 1970s, the Bureau constructed the Standard Statistical Establishment List (now called the Business Register) from economic census records and administrative data from the IRS and SSA; the register is continually updated using information from U.S. Census Bureau programs and administrative records (see Section 4.1; U.S. Bureau of the Census, 1979; Jarmin & Miranda, 2002). During the 1970s the U.S. Department of Agriculture used record-linkage techniques with a variety of data sources to improve list frames for agricultural surveys (Allen, 2008).
In 1973, one of the earliest large-scale survey linkage projects linked records from the Current Population Survey (CPS) with administrative records data from the IRS and SSA. This interagency collaboration allowed researchers to merge variables about earnings and benefits from the administrative data with variables obtained from survey respondents (see Section 2.2).
The first estimates from the Small Area Income and Poverty Estimates (SAIPE) program, which uses administrative data as input to statistical models for predicting poverty rates in areas with small survey sample sizes, were published in 1993 (see Box 2-2).
The 1994 Census Address Improvement Act (U.S. Congress, 1994) authorized the U.S. Postal Service to share its Delivery Sequence File, the list of all delivery point addresses served by postal carriers, with the U.S. Census Bureau. The Master Address File (MAF), the U.S. Census Bureau’s inventory

Page 86 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

of all known living quarters, was created in the late 1990s by merging the Delivery Sequence File with the inventory of living quarters enumerated in the 1990 Census (Uhl, 2011). After the 2000 Census, the MAF was integrated with the Topologically Integrated Geographic Encoding and Referencing (TIGER) system, a spatial database developed for the 1990 Census that captures geographic features such as streets, rivers, lakes, and railroads, as well as boundaries of political and census units (National Research Council, 2003). The continually updated MAF/TIGER system is used in decennial census operations and as a sampling frame for surveys.
The U.S. Census Bureau developed the Statistical Administrative Records System in the late 1990s, combining seven national administrative records datasets to test the feasibility of an administrative records census (Prevost & Leggieri, 1999; Judson, 2000). The Frames project (see Section 4.2) builds on this work.
In 2007, a consortium of agencies and research organizations published the first report from the Medicaid Undercount Project, established to study discrepancies in Medicaid enrollment counts between survey estimates and administrative records (SNACC, 2007).^b The U.S. Medicaid program was established in 1965 to provide health insurance for people with limited income, but a number of studies found that survey estimates of the number of persons receiving Medicaid were substantially lower than the number of persons known to be receiving Medicaid from state-level administrative records (see, e.g., Lewis, Ellwood, & Czajka, 1998). By linking records from the Current Population Survey Annual Social and Economic Supplement (CPS ASEC) with administrative records (the Medicaid Statistical Information System), the Medicaid Undercount Project team identified reporting errors on the CPS ASEC as the main source of the discrepancies.

__________________

^a https://www.census.gov/programs-surveys/popest/about.html. Population estimates are used for allocating federal funds, adjusting for survey nonresponse (since they provide independent population estimates for demographic groups), and as input to programs such as the SAIPE program.

^b The acronym SNACC comes from the first letters of the collaborating agencies: the University of Minnesota’s State Health Access Data Assistance Center, the National Center for Health Statistics, the Department of Health and Human Services Office for the Assistant Secretary for Planning and Evaluation, the Centers for Medicare & Medicaid Services, and the U.S. Census Bureau. See SNACC (2010); Davern et al. (2008, 2009); Noon, Fernandez, & Porter (2019); and Boudreaux et al. (2019) for summaries of the project’s findings.

Page 87 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

and research needed to assess and document the quality dimensions of administrative records.

4.1 CREATING LONGITUDINAL DATABASES FROM EXISTING RECORDS

Surveys that interview new samples each year provide cross-sectional information—a snapshot of a particular moment in time. Repeated cross-sectional surveys can be used to estimate changes in aggregate statistics over time, such as year-to-year changes in the percentage of people who smoke cigarettes. Longitudinal studies, which measure the same set of persons or businesses at multiple time periods, can additionally provide information on individual trajectories and statistics such as the percentage of people who were smokers in 2020 but were nonsmokers in 2021. Longitudinal surveys collect data from the same persons or businesses at repeated points in time, but these are often expensive, and low initial response rates as well as survey dropout over time can bias results.

An alternative is to link records belonging to the same person or business across existing datasets. As discussed in Chapters 2 and 3, linking records requires sufficient identifying information to determine that two records belong to the same entity. Additional challenges arise with longitudinal linkage of administrative datasets because recordkeeping standards and identification variables can change over time. Measurement standards and practices can also change—for example, the income variable from an administrative data source in 1990 might measure a different concept than the variable used for income in 2020, or new treatment codes in health care claims data may replace or supplement previous codes.

This section describes three U.S. Census Bureau projects that create longitudinal databases from administrative records and decennial censuses. Wagner and Layne (2014) described the U.S. Census Bureau’s Person Identification Validation System, which relies on probabilistic linkage methods (see Box 2-1). When the U.S. Census Bureau acquires a dataset, the Bureau compares personally identifying information in that dataset with information in the Census Numident file.¹ Each record in the Numident file is assigned a unique, anonymous identifier called a Protected Identification Key (PIK). When a match is determined, the PIK from the Numident record is

___________________

¹ The Census Numident file is derived primarily from the Social Security Administration Numerical Identification (Numident) file, which contains all transactions recorded for each Social Security Number.

Page 88 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

attached to the record in the dataset, allowing the new dataset to be linked with other U.S. Census Bureau data resources.²

Longitudinal Business Database

The Longitudinal Business Database is built from Internal Revenue Service (IRS) corporate and self-employment tax records that are used to maintain the U.S. Census Bureau’s Business Register of nonfarm business establishments.³ The Business Register, a regularly updated census of U.S. business establishments and firms with paid employees, contains information including business name and address, industry classification, size, employment, payroll, and receipts. It is the primary source for the annual County Business Patterns reports,⁴ and it serves as a sampling frame for business surveys and censuses.

The Longitudinal Business Database includes linked Business Register records belonging to the same establishment across time, going back as far as 1976, and it also incorporates information from other sources such as economic censuses. This allows researchers to study year-to-year changes in private employment and the entrance or exit of establishments across industry types, locations, and size classifications.⁵

Chow et al. (2021) described how the U.S. Census Bureau addressed challenges involved in this longitudinal linkage. For example, an employer identification number present in one year but not the next might belong to a business that discontinued operation, but this situation might also arise because the business received a new identification number. The linkage

___________________

² See https://www.census.gov/about/policies/quality/standards/standardc4.html for a description of statistical quality standards and confidentiality protections for linking data. The U.S. Census Bureau’s administrative data inventory can be found at https://www2.census.gov/about/linkage/data-file-inventory.pdf

³ See https://www.census.gov/econ/overview/mu0600.html and https://www.census.gov/programs-surveys/ces/data/restricted-use-data/longitudinal-business-database.html for descriptions of the Business Register and Longitudinal Business Database. Jarmin & Miranda (2002) described the development of the Longitudinal Business Database along with the history of earlier longitudinally linked establishment datasets such as the Longitudinal Research Database, which linked plant-level data from the Census of Manufactures and the Annual Survey of Manufactures (Davis, Haltiwanger, & Schuh, 1996).

⁴ https://www.census.gov/programs-surveys/cbp.html

⁵ Because the datasets involve IRS tax records, microdata can be accessed only by qualified researchers for approved projects in secure Federal Statistical Research Data Centers. Examples of research studies conducted using these datasets include Benedetto et al. (2007); Akee, Mykerezi, & Todd (2020); Cunningham et al. (2021); Goetz & Stinson (2021); Handley, Kamal, & Ouyang (2021); and Mahajan (2021). Kinney et al. (2011) described the creation of the Synthetic Longitudinal Business Database, intended for exploratory studies by a wider user base, in which data are simulated from statistical models intended to reproduce the structure of the real data.

Page 89 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

procedure thus also considered name and address to reconcile broken links. Industry classification systems and other definitions of data elements have changed over time, requiring harmonization of the many versions of Business Register data across more than 40 years.

Data-equity considerations arise because less information is available about small, single-establishment firms than about large firms. An imputation model was developed for establishments with unknown beginning and end dates. But Chow et al. (2021, p. 30) noted that “the training data for the imputation model consist almost entirely of establishments born to large multi-unit firms whereas the set of establishments with missing data comes almost entirely from small multi-unit or single-unit firms” and advised researchers to “exercise caution” when using these imputed data.

Longitudinal Employer-Household Dynamics Database

The Longitudinal Employer-Household Dynamics (LEHD) program integrates data from federal censuses, surveys, and administrative records with administrative records collected by states to create a longitudinal database about employment. Each state collects quarterly earnings and employment data to manage its unemployment insurance program. Under the Local Employment Dynamics Partnership, states agree to share unemployment insurance system wage records and other administrative data with the U.S. Census Bureau.

Abowd et al. (2009) described the datasets that are linked in the LEHD and the procedures used to create the database. State data files contain information about economic activity, but little information about the individuals in the wage records. Each individual is assigned a PIK through the U.S. Census Bureau’s linkage procedure, which allows demographic information to be appended. Additional information is linked from surveys (including the Current Population Survey [CPS] and the American Community Survey [ACS]), the Business Register, and other sources. These data can also be linked with other datasets containing PIKs; all research must be conducted in a restricted environment such as a Federal Statistical Research Data Center.

The resulting dataset contains longitudinal information about employers and employees that could not be obtained from a single probability survey. A household survey would not have information about the business characteristics of household members’ employers, and a survey of businesses would not have detailed information about the employees, but the LEHD has both. This allows the LEHD to produce statistics about employment, earnings, job creation, and job-to-job flows (including characteristics of the origin and destination jobs and earnings changes resulting from job transitions) for detailed levels of geography and industry classification. Data can

Page 90 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

also be disaggregated by worker characteristics such as sex, age, education, race, and ethnicity.⁶

As with any linked dataset, statistics are affected by the coverage of the data and linkage errors. LEHD data are limited to employees of businesses required to file unemployment insurance system wage reports in participating states, and thus do not present a full picture of employment in the United States.⁷

Decennial Census Digitization and Linkage Project

The Decennial Census Digitization and Linkage project, currently under way at the U.S. Census Bureau, will create a longitudinal database of records from decennial censuses from 1940 through 2020.⁸ Records from 1940, 2000, 2010, and 2020 have already been linked. Combining records from other censuses is challenging because the digitized microdata for 1960 through 1990 contain all variables from the censuses except the one piece of information crucial for linking the records—the respondent names. Genadek & Alexander (2019) outlined a plan to scan and digitize the names from microfilmed census records, so that the remaining years can be linked.

Genadek & Alexander (2019, p. 3) stated that the “resulting data resource will expand our understanding of population dynamics in the U.S. far beyond what is currently possible, providing transformational opportunities for research, education, and evidence-building across the social, behavioral, and economic sciences.” The linked data will, of course, have limitations. Because census data are available only for years ending in “0” and require a long processing time, the linked files lack both temporal granularity and timeliness. The decennial censuses have long undercounted certain populations, and undercounts before the U.S. Census Bureau used post-enumeration surveys to evaluate coverage were not as well studied (see Box 3-2). And, as discussed in Chapter 3, the quality of linkage information

___________________

⁶ See https://lehd.ces.census.gov/data for other data and statistics available from the program.

⁷ https://lehd.ces.census.gov/state_partners/ shows the set of states that participate in the partnership. Data are also obtained from the U.S. Office of Personnel Management to include federal employees. The previous National Academies of Sciences, Engineering, and Medicine report in this series (NASEM, 2023, p. 50) described the “time-consuming and daunting” process for negotiating data-sharing agreements with states to acquire data for the LEHD. Abowd & Vilhuber (2005) studied effects of linkage errors on individual job histories and aggregated statistics.

⁸ https://www.census.gov/programs-surveys/dcdl.html. The files are available to researchers with approved projects through the Federal Statistics Research Data Centers.

Page 91 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

varies across population subgroups and across years, resulting in an inequitable distribution of linkage failures.⁹

Of course, many of the concepts and questions on the decennial census have changed over the decades, but full documentation detailing those changes is available.¹⁰ The interactive infographic provided by the U.S. Census Bureau (2021e) shows the race categories used by each census between 1790 and 2020 and maps them to the U.S. Office of Management and Budget (OMB) categories described in Box 3-3 (OMB, 1997).

As an example of the type of research that can be done with the linked census files, Leach, Van Hook, & Bachmeier (2018) followed immigrant parents, their children, and their grandchildren from 1940–2014. To include intercensal years, they also linked data from selected years of the CPS and the ACS. They noted, however, that linkage errors could lead to bias. About 70 percent of the children of immigrant parents observed in the 1940 Census were assigned a PIK, and their characteristics differed from those of children without PIKs. In addition, for individuals who cannot be linked, it may be unclear whether the linkage failure is because of insufficient linkage information or because the individual died or emigrated. Leach, Van Hook, & Bachmeier (2018) attempted to correct for these sources of potential bias by using weighting methods similar to those used to adjust for nonresponse.

CONCLUSION 4-1: Longitudinally linked administrative records datasets provide a cost-efficient opportunity to study long-term outcomes, and they may have large sample sizes for key population subgroups that have low representation in other data sources. Careful curation and attention to linkage errors and data equity enhance the value of these datasets.

4.2 THE FRAMES PROJECT

The U.S. Census Bureau provides data and information about the people and economy of the United States. Some of that information comes from the decennial census and surveys that the U.S. Census Bureau conducts; other information comes from administrative records, private-sector data, or other sources. These data products are crucial for the functioning of the democracy, by providing freely available data and statistics to inform

___________________

⁹ For example, the set of persons obtaining Social Security Numbers (SSNs) has changed over time. The Social Security Act of 1935 excluded domestic and agricultural workers from the program. Domestic and agricultural workers, a large percentage of whom were Black, were less likely to have SSNs for earlier censuses and their records were thus more likely to be subject to linkage errors.

¹⁰ Bohme (1989) and U.S. Census Bureau (2002) provided historical context and listed the questions and instructions to enumerators for each census.

Page 92 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

decisions made by businesses, policymakers, researchers, and ordinary citizens (NASEM, 2021b).

Motivated by declining survey response rates, increased data-collection costs, and demand for more timely and granular data, the U.S. Census Bureau has proposed a new vision for an “enterprise approach” to statistical data. Santos (2022, slide 3) articulated the goals of the approach: “Improved collaboration with stakeholders and partners, improved data quality, stronger computing power, proliferation of alternative unofficial data products, and new technologies.” The data ecosystem is designed to 1) “provide a cloud-centric data storage and computing platform for survey operations”; 2) provide sampling frames that are linkable to other sources and accessible for research purposes; 3) provide modernized data-collection and acquisition solutions that are cost effective, efficient, and scalable; and 4) broadly disseminate publicly available data products in a way that facilitates their use (Santos, 2022, slide 9). This is a long-term project, and, according to Santos, it will require “a decade or more of concerted effort to become sustainable and to achieve maturity.”

Figure 4-1 shows a schematic of the Frames project. Initial steps involve linking four internal U.S. Census Bureau frames: geospatial, business, job, and demographic (Ratcliffe, 2021a,b). These frames use information from the Master Address File/Topologically Integrated Geographic Encoding and Referencing system (MAF/TIGER, see Box 4-1), the Business Register, the inventory of jobs linked to businesses (underpinning the LEHD database), and administrative records databases used in conjunction with the 2020 Census. All the data sources can be used individually to produce statistics about society but linking them will allow for insights that span the individual topics. As seen in Figure 4-1, further linkages are planned with other U.S. Census Bureau data resources, public records, and data acquired from private-sector sources.

Keller et al. (2022, p. 2) stated that linking multiple data sources at the U.S. Census Bureau “represent[s] a necessary evolution beyond the survey-only model that has reached scientific and practical limits in an era of increasing demand for more data, more often, and more urgently. It holds the promise of producing more timely, robust, and accurate findings and to more fully reflect the diversity of the nation’s racial and ethnic composition.”

Salvo (2022) commented that the Frames project will provide “the scaffold … for the capture and integration of massive amounts of information [leading to] a universal frame that could form the foundation for a transformative capability to integrate” data. He emphasized the importance of such data for local governments and mentioned, as one example, local planners’ needs for timely and granular data during the COVID-19 pandemic. To make this project valuable to planners at all levels of government,

Page 93 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

**FIGURE 4-1** The U.S. Census Bureau’s *Frames* project.
SOURCE: Santos (2022, slide 11).

Page 94 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

once the frames are integrated, data must be “curated” to make them consistent, accessible, and “actionable at a local level.” Salvo also recommended developing “use cases to frame discussions with researchers for research agenda development.”

One such use case involves challenges involved in improving researchers’ understanding of nursing home residents: “Nursing homes are businesses, nursing homes are places where people live, nursing homes have workers” but the “different dimensions of the nursing home picture are not integrated” (Salvo, 2022). Obtaining a comprehensive picture of elder care requires data from many sources, including census and survey data about demographics, income, and health from federal statistical agencies; administrative data from agencies such as the Centers for Medicare & Medicaid Services; information about nursing homes and their employees from sources such as the Business Register and the LEHD; data from state departments of public health and social services; and data from the private sector and nonprofit organizations such as the Kaiser Family Foundation.

Challenges in realizing the Frames vision include identifying data relevant to the particular problem to be addressed and the fitness for use of those data, as well as obtaining new, high-quality data. Harmonizing varying definitions of concepts and relevant geographies is also critical. Ratcliffe (2021b, slide 3) noted: “Frames exist in an uncoordinated and unintegrated environment” and “[n]o process exists that allows for the direct linkage of information contained in one frame with information in any other frame.”

Santos (2022) highlighted the importance of using a data-equity lens to improve policies and practices. The data-equity goals of the Frames project include improving coverage of underrepresented groups (capturing individuals who may be in one data source but not others) and increasing sample sizes for small population subgroups, thus enabling production of statistics about those subgroups.

All the individual data sources are incomplete, however, and their union may be incomplete as well. The LEHD, for example, contains only people working for employers in participating states, and may miss self-employed persons. Business files based on tax records will underrepresent new businesses and overrepresent failed businesses. Address files might not capture all new construction or housing abandonment, particularly in sparsely settled or unincorporated locations. Administrative records may be available for only some locations, some population subgroups, or some years. As discussed in Box 3-2, the decennial census differentially undercounts certain race and ethnicity groups. An ongoing evaluation program is important for assessing data-equity impacts of the Frames project.

Discussing the potential use of administrative records in the decennial census, McClure, Santos, & Kooragayala (2017) noted:

Page 95 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

The Census Bureau has researched the use of administrative records in enumeration for decades, yet the full implications of such a methodology are still unclear. How accurate is the methodology for different subpopulations? What assumptions about accuracy have been made? What are the costs, risks, and benefits of this approach? Understanding the proposed methodology and the substantive consequences of incorporating it in the census is as critical as understanding the benefits. This is especially true for subpopulations that may have their civil rights affected as a consequence of this new approach (p. viii).

McClure, Santos, & Kooragayala (2017, p. 12) also observed: “People who do not routinely interact with society’s public institutions are less likely to be represented in administrative records (i.e., they are more ‘off the grid’) … The limited information about these people that may still be found in these sources could be more likely to be incomplete or inaccurate (e.g., emergency room visits by undocumented immigrants or the homeless).”

One essential aspect of administrative data linkage projects is ensuring public trust, as was emphasized in the previous National Academies of Sciences, Engineering, and Medicine report in this series (NASEM, 2023). In a discussion of records-based alternatives to the decennial census, the National Research Council (1995, p. 62) noted that the “prospect of ongoing linkage of federal, state, and local government data would be opposed by many people.” Linkage of administrative sources requires acquiring and processing the data, but typically does not require obtaining consent from persons or businesses whose records are found in the data. The previous National Academies’ report emphasized that “transparency is critical to building the trust essential to engendering widespread support for a new data infrastructure” and “must be a stated requisite in the legal basis of a new data infrastructure, as well as part of that infrastructure’s data-governance framework” (NASEM, 2023, pp. 58–59). The Commission on Evidence-Based Policymaking (2017, p. 17) stressed that “[i]ndividual privacy and confidentiality must be respected in the generation and use of data and evidence” and “[t]hose engaged in generating and using data and evidence should operate transparently, providing meaningful channels for public input and comment and ensuring that evidence produced is made publicly available.”

The linkages in the Frames project can facilitate study of population groups missed in each source and can point the way to improving coverage and representation—although they cannot help with populations missed by all sources. It may be possible to use data in one frame to update information in another—for example, using information in the Business Register to update MAF listings (Ratcliffe, 2021a). But there are many challenges ahead for this work, including assessing coverage and the impact of linkage

Page 96 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

errors (see Conclusion 3-2), and it requires cooperative research across the U.S. Census Bureau.

Santos (2022) emphasized the importance of continuing to develop innovative methods of using and combining datasets and of encouraging cooperation among the divisions that house data. “Baking innovation into Census Bureau operations” will require new skills and adaptability, and Santos stressed the need for “human capital strategies, so that [the U.S. Census Bureau] can better recruit, develop, and retain a dynamic and diverse workforce.”

CONCLUSION 4-2: Linking administrative data and sampling frames can enable useful future data linkages for social science research and evidence-based policy analysis. However, combined data sources do not necessarily have either full population coverage for generating national statistics or sufficient sample sizes to investigate differences among population subgroups.

4.3 THE NATIONAL VITAL STATISTICS SYSTEM

Sections 4.1 and 4.2 described U.S. Census Bureau activities in linking records obtained from administrative records and censuses. Another model for bypassing surveys and using administrative records directly involves acquisition and standardization of administrative records directly from state and local governments. This is the approach taken by the National Vital Statistics System (NVSS), which keeps track of all births and deaths in the United States.

The NVSS is the oldest national example of cooperative data sharing in the United States, dating back to 1880 (see Box 4-1). It is coordinated by the U.S. National Center for Health Statistics (NCHS) within the Centers for Disease Control and Prevention. Data are provided through contracts between NCHS and vital registration systems operated by the 50 states, two cities (Washington, DC and New York City), and five territories. The legal requirements for registering births and deaths rest with states, but states work together with NCHS to build a uniform system that provides national data (NCHS, 2021a).

Uniformity of data collection is promoted through use of standard certificates of death, fetal death, and live birth. These are revised periodically in cooperation with state vital statistics offices. Additionally, “model procedures for the uniform registration of the events are developed and recommended for nationwide use through cooperative activities of the jurisdictions and NCHS.”¹¹ These specify the duties of the state registrar,

___________________

¹¹ Quoted from https://www.cdc.gov/nchs/nvss/about_nvss.htm, which also provides links to the standard forms, model procedures, and guidance for persons completing certificates.

Page 97 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

procedures for recording births and deaths, and regulations covering disclosure of information from vital records.

Consequently, the same minimal set of information is collected in every state (some states collect additional information). Death records include information on the decedent’s residence, birthplace, surviving spouse, location of death, race, ethnicity, sex, educational attainment, marital status, and cause of death. The race and ethnicity categories accord with OMB standards discussed in Box 3-3 (OMB, 1997). The death certificate also contains an item asking for the decedent’s SSN, which facilitates linkage to other sources.

Several characteristics make the NVSS a model for cooperative data collections. First, it has extraordinarily high coverage of the population of births and deaths. Murphy et al. (2017, p. 3) stated that more than 99 percent of deaths are included in the system. This coverage was accomplished after long collaborative effort—it took 53 years to get all states to contribute data (see Hetzel, 1997).

NCHS also has an ongoing program for quality improvement in data collection, processing, and dissemination. It conducts regular investigations of measurement error in demographic and cause-of-death information (e.g., see Section 3.5 and Hedegaard & Warner, 2021).

Since the NVSS is a census of all vital events, it is highly granular and can be used to produce statistics about small population subgroups. But, at present, the data are not timely: there is a lag between the vital events and the release of the final data file. The final mortality report for deaths in 2019 was published in July 2021 (Xu et al., 2021), although provisional data were available earlier.

A modernization program is under way “to transform the National Vital Statistics System into a tool for real-time public health surveillance.”¹² NCHS is working with states to improve the timeliness and quality of death data; the NCHS Modernization Tool Kit provides training materials, tools, and documentation to help jurisdictions establish and learn to use electronic death-reporting systems. These systems are expected to not only speed the production of data—one short-term goal is for NCHS to receive at least 80 percent of mortality records within 10 days of the event—but also promote more complete and more accurate information because data items can be validated as they are entered. The modernization is part of a larger effort within the CDC to collect more timely data and promote interoperability among data collections including vital records, electronic health records, and electronic laboratory reports (CDC, 2021a).

___________________

¹² https://www.cdc.gov/nchs/nvss/modernization/goals-accomplishments.htm. See also NCHS (2021b).

Page 98 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

CONCLUSION 4-3: The National Vital Statistics System can serve as a model for assembling state-administered data programs into coordinated, standardized national databases of administrative records that can be linked to other data sources.

4.4 LINKING DATA AT THE STATE OR REGIONAL LEVEL

The NVSS is perhaps the most complete and successful example of federal coordination of data that are submitted by states. Standardized certificates are used to ensure that submitted information is consistent across locations. For other data collections, for which there are no national standards or federal coordination, each state designs its own data collections to meet program administration needs. This lack of uniformity makes it challenging to use these data for national statistics or research that is national in scope. Because these data are collected for program administration, they exclude individuals who might have been eligible for the programs but did not enter the system. Negotiating data-sharing agreements is also a challenge (NASEM, 2023).

To provide insights on local issues, several state and regional collaboratives have formed to link state administrative data. These collaborations focus on data harmonization in subnational areas, which can lead to greater consistency across data collections. This section focuses on three examples: an integration of data about children and families in Illinois, current work in the State of Washington, and the multistate Coleridge Initiative.

Illinois Integrated Database of Child and Family Programs

Chapin Hall at the University of Chicago began building the Integrated Database of Child and Family Programs in the mid-1980s, to study the children’s services system in Illinois (Goerge, van Voorhis, & Lee, 1994; Kitzmiller, 2013). At the time, each agency serving children and families had separate datasets. The database integrates data from Illinois and Chicago agencies that administer the foster care system, investigate child abuse and neglect, and administer assistance and health insurance programs such as public housing, the Supplemental Nutrition Assistance Program, and Medicaid. Additional information is obtained from Chicago Public Schools, Chicago Police Department, the juvenile court system, birth certificates, and other sources. The linked database contains longitudinal information about the experiences of all families and children receiving child protective services since 1990.

This database has been widely used to research important child-welfare issues. Examples of such analyses are Goerge, Harden, & Lee (2008), on the consequences of teen childbearing for child abuse, neglect, and foster

Page 99 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

care placement; Goerge et al. (2009), who analyzed child care subsidy participation and employment outcomes among low-income families in Illinois, Maryland, and Texas using state administrative data linked with the 2001 ACS; Gennetian et al. (2016), on the association between timing and frequency of SNAP program benefits and student outcomes in grades 5–8, as measured by school disciplinary records; and Herz et al. (2019), who studied youth who experience both the child welfare and juvenile justice systems. Most recently, a study of families who experienced services in multiple public systems highlighted issues faced by these families (Goerge & Wiegand, 2019).

Washington State Department of Social and Health Services

The Research and Data Analysis Division of the Washington State Department of Social and Health Services integrates data from dozens of administrative systems to support research and other analytic use cases. Data are integrated at the individual level into a repository referred to as the Integrated Client Data Repository (ICDR), which is designed to protect privacy and confidentiality. Additional agreements with state agency data suppliers define the governance processes in place to authorize analytic activities that use ICDR data. Examples of the types of data for Washington State residents contained within the ICDR, some dating back to the 1990s, include:¹³

Medicaid and Medicare claims data spanning domains of physical health, mental health, substance use disorder, long-term care, and developmental disabilities;
Child welfare system data;
Food and cash assistance data;
Vocational and supported employment services;
Housing program and homelessness data;
Vital records, including births and deaths;
Employment and earnings data from the unemployment insurance system; and
Criminal justice data spanning domains of arrest, jail booking, adjudication, incarceration, and community supervision.

___________________

¹³ https://aisp.upenn.edu/network-site/washington-state; Mancuso & Huber (2021). An extensive library of State of Washington health and human services publications is found at https://www.dshs.wa.gov/ffa/research-and-data-analysis. Its research projects are supported by ad hoc funding from state agency program partners, typically with a federal grant as the underlying source of the funding.

Page 100 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

Most data sources are updated on at least a monthly basis. ICDR data are analyzed for a wide range of use cases, including:

Quasi-experimental analysis of program and service impacts on client outcomes;
Predictive modeling of populations at risk of adverse outcomes;
Measurement of quality of services received according to defined standards of care;
Analysis of disparities and differences in client experiences by race/ethnicity and other demographic characteristics;
Clinical decision support for care management of high-risk Medicaid beneficiaries; and
Ad hoc descriptive policy analysis.

Multistate Collaborations

The previous two examples combined data sources within a single state. The final example in this section describes a collaborative effort to establish a multistate data infrastructure. Many metropolitan areas straddle state boundaries, but data sources from those states may be in separate enclaves and in incompatible formats. Cunningham et al. (2022) noted that the Foundations for Evidence-Based Policymaking Act of 2018 calls for changes in the way federal data are accessed and used, and that similar changes are needed for state and regional data.

The Coleridge Initiative has organized collaborations that allow for regional data sharing and access.¹⁴ The Initiative does this by providing a secure cloud-based platform, the Administrative Data Research Facility, where confidential microdata can be accessed and linked. The provision of training programs to build the capacity of agency staff to work with the data is an important component of the initiative. Kreuter, Ghani, & Lane (2019) described a program that teaches government employees how to analyze confidential, individual-level data that originate from administrative datasets. The program includes modules on analytical design, database management, data visualization, record linkage, machine learning and text analysis, statistical inference, confidentiality, and data ethics.

Kuehn (2022b) identified the need for a multistate data infrastructure by focusing on the needs of Ohio, for which several metropolitan areas (in particular, Cincinnati, Toledo, and Youngstown) straddle state boundaries.

___________________

¹⁴ https://coleridgeinitiative.org; Cunningham et al. (2022); and Kuehn (2022a). As of June 2022, the Coleridge Initiative has worked with Arkansas, Connecticut, Illinois, Indiana, Kentucky, Maine, Michigan, Missouri, New Hampshire, New Jersey, Ohio, Rhode Island, Tennessee, Texas, and Vermont.

Page 101 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

Kuehn (2022b, pp. 8–9) noted: “On the technical side, Ohio and its regional partners needed a secure, usable data platform that could flexibly host data from several states without threatening the states’ control of their own data…. Each [state] has different data governance practices that have resulted in different approaches to collaboration.” An example of the data produced is the Multi-State Postsecondary Dashboard, which examined, among other topics, the percentage of graduates from each major who are employed, and their locations (in-state or out-of-state) and earnings.¹⁵

Fischer et al. (2019, p. 677) outlined challenges for developing and maintaining an integrated data system. Foremost is gaining access to a service provider’s confidential records, which “requires the cultivation of trusting and mutually beneficial relationships.” Additionally, they noted:

Since all administrative data in an IDS [Integrated Data System] were originally collected for program purposes, not research, the attention to accuracy and reliability is not as high as would be expected for data collected in controlled research settings. As a secondary data source, the richness and quality of data in an IDS is dependent on the quality of underlying administrative records. Data quality standards are applied after the fact through examining aberrant patterns and addressing outliers, but adjustments are necessarily imperfect. Similarly, changes in technology used by data providers can result in changes to data already being supplied to an IDS. For example, data providers may have funders that have required them to change the type of information they collect or how they collect it. Ongoing communication with the data partner has been essential during these times of transition in order to guard against unintended data lapses or misinterpretation (Fischer et al., 2019, p. 679).

The Coleridge Initiative approach, in particular, deals with issues of harmonizing data across states in ways that could scale to larger projects.

State-level linkages have demonstrated the value of administrative data both for research and for state-level program monitoring and evaluation. While states have developed useful research and data privacy-protecting practices, cross-border population mobility and differing legal, technical, financial, and practical considerations across states make these initiatives difficult to scale to the national level. Multistate initiatives such as the Coleridge Initiative provide ideas for harmonizing data concepts and promoting data sharing, and these initiatives have potential for scaling to larger regions.

___________________

¹⁵ https://coleridgeinitiative.org/projects-and-research/multi-state-post-secondary-dashboard/. See also Cunningham (2021).

Page 102 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

4.5 USING ADMINISTRATIVE RECORDS TO PRODUCE STATISTICS

As the examples in this chapter demonstrate, using administrative records is not simply a matter of grabbing a convenient dataset off the shelf (or from the cloud) and popping it into a statistical software package. The data user needs to understand the quality and properties of the administrative records and often must do substantial data cleaning and processing before combining administrative data with other data sources. The Federal Committee on Statistical Methodology noted:

Statistical agencies in many countries have extensive, well-established methods for identifying and reporting threats to quality in data collected and designed for statistical purposes, particularly sample surveys. Methods are less well-developed for dealing with threats to quality from sources other than surveys, such as administrative records and readings from sensors, and other data originally collected for nonstatistical purposes (2020, p. 1).

Using administrative data for statistical purposes requires an understanding of the processes used to create, collect, and process the data (Singh et al., 2020). Several frameworks have been proposed for assessing administrative data quality, including Daas et al. (2009); Iwig et al. (2013); Seeskin, Ugarte, & Datta (2019); Statistics Canada (2019); United Kingdom Statistics Authority (2019); and United Nations (2019). These include assessments of the components of quality described in Figure 1-1 and checklists for reporting on quality. Rothbard (2013) provided practical advice on preparing administrative records for analytical use.

Goerge & Lee (2002) discussed the importance of cleaning administrative data prior to linkage and analysis. They noted that administrative data often lack documentation about measurement and quality, and that intensive research is needed to understand the processes behind collecting, processing, and storing the data. Sometimes the original architects of administrative data have moved on to other projects and the institutional history has been lost. Culhane et al. (2010, p. 6) wrote that “many agencies are often too busy with business processes to assess their own data quality on a regular basis” and “an external hosting partner who reviews the data can provide an opportunity for data improvement.” Documentation on using administrative datasets for statistical purposes must be prepared by researchers if not supplied by the originating agency.

Boruch (2011) stressed the importance of evaluating and documenting sources of error in administrative records and gave a taxonomy of issues to consider. These include the meaning of key measures such as homelessness and disability, which may vary across programs or personal perspectives, or the distinction between urban and rural residence; the variation over

Page 103 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×

time in definitions, such as changes in race categories; and the difficulty in collecting accurate information on topics such as income, for which the content of probing questions varies across data sources. Boruch (2011) also mentioned issues that might make linking more difficult, such as names versus nicknames, errors in reporting identification information such as SSNs, and coding errors that might occur during data entry.

Section 4.2 addresses one aspect of data equity for data integration efforts: the potential of improving statistics for historically underrepresented population subgroups by obtaining data from multiple sources. Addressing other data-equity aspects, Santos (2022) emphasized the importance of engaging with data consumers as well as with persons and businesses who provide data through surveys or indirectly through administrative records, to better understand their needs and to increase trust and confidence. This raises important questions about equitable approaches for public data access, confidentiality of linked information, data ownership, and the effect of data-combination programs on public trust—trust from all parts of the population. These issues will be explored in future workshops in this series.

This chapter focuses on the value of databases constructed solely from administrative records. Chapters 5–8 discuss examples of integrating administrative records and other data sources with surveys to improve statistics about income, health, crime, and agriculture.

CONCLUSION 4-4: Administrative records are a valuable source of information for official statistics and social and economic research. Each administrative records dataset considered for use in creating national statistics needs to be understood in terms of both its original and its proposed uses. This includes assessing the dataset’s fitness for use, timeliness, continuing availability, population coverage, measurement of key concepts, and equity aspects.

Page 104 Cite

Suggested Citation:"4 Creating New Data Resources with Administrative Records." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.

×