Page 69 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs

FEDERAL STATISTICAL AGENCIES AND PROGRAMS cannot be static. To provide information of continued relevance for public and policy use, they must regularly engage with a broad range of users to identify emerging needs and look for ways to develop new information that can serve broad purposes. To improve the quality and timeliness of their products, they must keep abreast of methodological and technological advances and be prepared to implement new procedures in a timely manner (see Practice 9). They must also continually seek ways to make their operations more efficient and less burdensome (see Practice 10).

Preparing for the future requires that agencies periodically assess the justification, scope, and frequency of existing data series, plan new or modified data series as required, and be innovative and open to new ways to improve their programs. Because of the decentralized nature of the federal statistical system, innovation often requires cross-agency collaboration (see Practice 13) and a willingness to implement different kinds of data collection efforts to answer different needs.

Two changes in policy and outlook can help statistical agencies foster the needed spirit of innovation. The first is to focus on the desired outputs of their programs by defining their primary business as that of providing relevant, accurate, and timely statistics obtained in a cost-effective manner. An output-oriented focus should help agencies justify and implement difficult decisions to modify or replace data collection and estimation programs that have lost their relevance, timeliness, or accuracy.

Page 70 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

The second is to adopt as a matter of stated policy a paradigm of using multiple data sources to generate needed information as an expansion of the long-dominant paradigm of using probability sample surveys. This new paradigm, which federal statistical agencies are already embracing, recognizes the continued importance of surveys, both cross-sectional and longitudinal. At the same time, it explicitly recognizes the roles of administrative records and other third-party sources, along with the use of new methods for combining data from multiple sources, as key elements of a cost-effective strategy to serve users’ needs.⁴²

In considering new data collection, estimation, and dissemination strategies for the future, statistical agencies must be mindful of tradeoffs among relevance, accuracy, timeliness, comparability over time and with other data sources, transparency, costs, and respondent burden. It will not usually be possible to maximize all seven criteria at the same time, given constrained budgets, but using multiple data sources will enable statistical agencies to better address this challenge.

ROLES FOR SURVEYS

Many current statistical programs rely on well-established probability sampling methods that draw representative samples of a population, such as household members or business establishments, interview the sample units, and produce estimates that account for known errors in population coverage and missing data and have a quantifiable level of uncertainty from sampling variability. Box III.1 provides a brief history of probability sampling for federal statistics and lists examples of long-running federal surveys.

Declining rates of response over the past 30 years in the United States (and in other countries), however, are making it increasingly difficult to contain the costs of data collection with traditional surveys in ways that do not risk compromising the quality of the data (see, e.g., Brick and Williams, 2013; de Leeuw and de Heer, 2002).⁴³ User demands for timeliness and granularity of estimates also strain the ability of statistical agencies to respond using established survey techniques.

Survey researchers are actively seeking ways to maintain and improve both the quality and the cost-effectiveness of surveys. For example, more surveys are using multiple modes to facilitate response (Internet,

__________________

⁴² See Lohr and Raghunathan (2017) on methods for combining survey and nonsurvey data with examples of applications.

⁴³ Lower response rates reduce the effective sample size and increase the sampling error of survey estimates; lower rates may also increase response bias in survey estimates.

Page 71 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Box III.1 Federal Surveys: Brief History and Examples

The probability sampling paradigm represented a quantum leap forward in providing cost-effective information on a variety of subjects when it was first introduced for federal surveys beginning in the late 1930s (Citro, 2014; Duncan and Shelton, 1978). For example, no longer did everyone in the United States have to answer a long battery of questions in the decennial census, since the use of a separate “long-form” questionnaire administered to a sample of the population could produce reliable estimates for the nation and smaller areas (see National Research Council, 2007b; the American Community Survey replaced the long-form sample in 2005). Moreover, probability surveys produced much more reliable results than surveys that did not use probability techniques.

Survey estimates (e.g., unemployment rates produced from the predecessor to the Current Population Survey in the early 1940s) quickly gained widespread public and policy acceptance. In the following decades, the federal government came to rely on probability surveys to collect information on a wide array of topics, as shown in the examples below (with the agency or program responsible for the survey).

Repeated Cross-Sectional Surveys
[Ask the same questions of new samples of respondents each month, year, or other
reporting period; provide regularly updated estimates of key statistics.]

Household Surveys

Current Population Survey (CPS): 1940–present; monthly unemployment rate (BLS)*

CPS Annual Social and Economic Supplement: 1947–present; annual median income, poverty rate, health insurance coverage (U.S. Census Bureau)

Consumer Expenditure Survey: 1950, 1960–61, 1970–72; 1980–present; quarterly interviews, 2-week diaries; market-basket weights for the Consumer Price Index (BLS)*

National Health Interview Survey: 1957–present; annual health status, health insurance coverage (updated quarterly) (NCHS)*

National Crime Victimization Survey: 1972–present; annual estimates of crimes, both reported and not reported to the police (BJS)*

American Housing Survey: 1973–present; annual estimates of housing conditions from 1973–1981; biennial estimates since that time (PD&R)*

Business Establishment Surveys

Monthly Wholesale Trade Survey: 1940–present; principal economic indicator (U.S. Census Bureau)

Business R&D [Research and Development] and Innovation Survey: 1953–present; annual business expenditures on R&D (formerly Survey of Industrial R&D) (NCSES)*

National Hospital Care Survey: 1965–present; annual statistics on in-patient, out-patient, and emergency medical care (NCHS)

Commercial Buildings Energy Consumption Survey: periodically 1979–present; energy use for stores, malls, etc. (EIA)

Agricultural Resource Management Survey: 1996–present; annual statistics on farm practices and farm and farm household income (similar statistics collected back to 1975 in two surveys) (ERS, NASS)

Page 72 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Panel Surveys
[Follow the same respondents over time to understand antecedents and
consequences of changes in their circumstances]

Survey of Doctorate Recipients: 1979–present (NCSES)

National Longitudinal Survey of Youth: 1979–present; 1997–present (BLS)

Early Childhood Longitudinal Study, Kindergarten Class of 1988–89: 1998–2007 (NCES)

NOTES: *: Data collected by U.S. Census Bureau. BLS, Bureau of Labor Statistics; BJS, Bureau of Justice Statistics; EIA, Energy Information Administration; ERS, Economic Research Service; NASS, National Agricultural Statistics Service; NCES, National Center for Education Statistics; NCHS, National Center for Health Statistics; NCSES, National Center for Science and Engineering Statistics; PD&R, Office of Policy Development and Research, U.S. Department of Housing and Urban Development.

smartphone, telephone, mail, in person), as well as using paradata⁴⁴ to improve survey operations and facilitate “responsive” or “adaptive” survey designs (see National Research Council, 2013a).

Surveys should remain an important component of federal statistical agencies’ portfolios for two major reasons: (1) some information is not readily ascertained except by asking questions; and (2) surveys can collect information on many characteristics at the same time, thereby permitting rich multivariate analysis. Yet the challenges to the survey paradigm make it essential to consider how use of other data sources can bolster the completeness, quality, and utility of statistical estimates while containing costs and reducing respondent burden (see National Academies of Sciences, Engineering, and Medicine, 2016, 2017b).

ROLES FOR ADMINISTRATIVE RECORDS

Administrative records include records of federal, state, and local government agencies that are used to administer a government program. Examples include U.S. Social Security Administration records of payroll taxes collected from workers and benefits paid out to retirees and other beneficiaries; state agency records of information provided by applicants for assistance programs and payments to applicants deemed eligible; and property tax records of local governments.

Administrative records are not generated probabilistically, as are surveys, but they are not unlike household or business censuses and can be evaluated

__________________

⁴⁴ Paradata are data about the source that are gathered in real time, such as the length of time to complete a survey.

Page 73 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

in similar ways. Administrative records are designed to capture information for all instances of a specified population (e.g., program beneficiaries) according to a set of rules typically based in statute or regulation, and, like censuses, they may have omissions or duplications, and the variables in the records may differ in accuracy.⁴⁵ The records may also be stored in difficult-to-use formats, not well documented, or not provided on a timely basis. Acquiring the records requires negotiations with the custodial agency, and their contents may change when program rules change. Yet efforts to develop error profiles for administrative records (see Practice 9) and productive relationships with the custodial agency (see Practice 7) can have sizable payoffs, as is evident in several well-established statistical uses of records.

For example, administrative records are used to generate up-to-date population estimates by age, race and ethnicity, and gender. In turn, these estimates are used to adjust population survey weights for coverage errors and for many other purposes.⁴⁶ Tax records are used instead of questionnaires for the Census Bureau’s economic censuses and surveys for nonemployer businesses. Administrative records are more and more used with survey data to produce model-based estimates with improved accuracy for small geographic areas or population groups.⁴⁷

There are many other statistical uses that agencies should consider for administrative records, for which the investment in data collection has already been made. In some instances, records could improve the cost-effectiveness and data quality of current statistical programs (e.g., by substituting administrative records for survey questions). In other instances, they could add richness to the combined dataset (e.g., by appending administrative records variables to matched survey records).⁴⁸ Box III.2 presents examples of innovative uses of administrative records for federal statistics. National Research Council (2009e) provides a comprehensive strategy for using administrative records to improve income information in the Census Bureau’s Survey of Income and Program Participation.

__________________

⁴⁵ For example, payments to beneficiaries may be more accurate than information provided at the time of application regarding a beneficiary’s characteristics.

⁴⁶ See, e.g., National Research Council (2004a, 2007b).

⁴⁷ See, e.g., National Research Council (2000c,d), on the Census Bureau’s Small-Area Income and Poverty Estimates (SAIPE) Program and recommended improvements.

⁴⁸ Extant matches include: (1) matches of Social Security earnings histories and Medicare benefits with the Health and Retirement Study and other surveys to analyze retirement decisions and the effect of medical care use on income security (see National Research Council, 1997a; National Research Council and Institute of Medicine, 2012); and (2) matches of employer and employee survey data with state employment security agency records in the Census Bureau’s Longitudinal Employer-Household Dynamics Program to analyze business and employment dynamics (see National Research Council, 2007a). Access to matched datasets must be restricted to protect confidentiality (see Practice 8).

Page 74 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Box III.2 Uses of Administrative Records for Federal Statistics: Examples

Demographics and Socioeconomic Data The U.S. Census Bureau has built a record linkage infrastructure, based on assigning protected identification keys to census and survey records for matching with administrative records. One project (sponsored by ERS) linked SNAP records for New York State, 2010 census responses, and county-level labor market indicators, for research on how local labor market conditions affect the likelihood that SNAP participants leave the program. See https://www.census.gov/library/working-papers/2016/adrm/carra-wp-2016-10.html. Another project, the National Longitudinal Mortality Study (sponsored by NIH and NCHS), links selected data from the CPS and census records with death certificate information. See https://www.census.gov/did/www/nlms/about/index.html.

Education NCES has for many years combined surveys of secondary and postsecondary students with administrative data, including school transcripts and student loan records maintained by the U.S. Department of Education. See https://nces.ed.gov/statprog/handbook/index.asp.

Energy EIA introduced monthly statistics in March 2015 on crude oil transport by rail, based on waybill data from the U.S. Surface Transportation Board and administrative data from Canada’s National Energy Board. See http://www.eia.gov/petroleum/transportation/methodology.pdf.

Health NCHS and CMS have a joint program to link Medicare records to NCHS surveys (e.g., the National Health and Nutrition Examination Survey). See https://www.cdc.gov/nchs/data/datalinkage/cms_medicare_methods_report_final.pdf. Beginning in 2012, NCHS replaced its separate surveys of nursing homes and other long-term care providers with the biennial National Study of Long-Term Care Providers; for nursing facilities, home health agencies, and hospices, NCHS now uses only administrative data from CMS. See https://www.cdc.gov/nchs/nsltcp/nsltcp_questionnaires.htm.

NOTE: All URL addresses were active as of April 2017. CMS, Centers for Medicare and Medicaid Services; CPS, Current Population Survey; EIA, Energy Information Administration; ERS, Economic Research Service; NCES, National Center for Education Statistics; NCHS, National Center for Health Statistics; NIH, National Institutes of Health; SNAP, Supplemental Nutrition Assistance Program.

Some uses may require not only state-of-the-art confidentiality protection techniques (see Practice 8), but also explicit legal authority (see “Toward the Paradigm of Multiple Data Sources,” below).⁴⁹

ROLES FOR NONTRADITIONAL DATA SOURCES

Statistical agencies are currently exploring the use of data sources, other than surveys and administrative records, that hold promise to improve the

__________________

⁴⁹ Statistics Canada, the national Canadian statistical agency, which has full access to administrative datasets, asks respondents to the Canadian Income Survey (and its predecessor) for permission to use income tax records in place of questions: see http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=5200 [April 2017].

Page 75 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

relevance, accuracy, and timeliness of federal statistics. These nontraditional data sources include, among others, data gleaned from relevant Internet websites (e.g., price quotes or social media postings), extracted from sensors (e.g., traffic cameras), and obtained from the private sector (e.g., scanner data on consumer purchases). Often, these sources generate large volumes of data that require data mining and other computationally intensive techniques for extracting information (see National Research Council, 2008a, esp. App. H). 50

Some agencies are already using nontraditional data sources. For example, the Economic Research Service (in the U.S. Department of Agriculture) obtains expenditure data scanned by households from store receipts from a private vendor and has evaluated the quality of the data (Muth et al., 2016). The National Center for Health Statistics (NCHS), in its surveys of hospitals and other health care providers, obtains data from questionnaires, abstracts of samples of patient records, and providers’ electronic medical care claim records.⁵¹ See Box III.3 for other examples of current uses of nontraditional data sources.

Most nontraditional data sources present significant challenges to statistical agencies to evaluate the accuracy and error properties of the information. For example, harvesting website data to develop up-to-the-minute consumer price indexes⁵² may offer significant timeliness and cost savings compared with traditional methods, but it is not clear how to adjust these data for consumer expenditures that occur off-line so that they accurately represent the universe of purchases. More generally, information that is taken from the Internet cannot usually be described or evaluated according to either a probability survey paradigm or a rules-based administrative records paradigm. For example, people who post items to sell on an auction website do not comprise any specified population. Another challenge of nontraditional data is that statistical agencies lack control over the consistency over time or among vendors or sites, so that deciding to rely heavily on such data sources carries high risks of compromising key time series if a vendor or site ceases operation or there are marked changes in data content or population coverage.

Yet in an era when data users expect timeliness and when budgets are constrained, statistical agencies should actively explore means by

__________________

⁵⁰ Such data are often referred to as “big data,” which are characterized by high volume, velocity, and variety and require new tools and methods to capture, curate, manage, and process in an efficient way. See https://unstats.un.org/bigdata/ and https://unstats.un.org/unsd/statcom/doc15/BG-BigData.pdf [April 2017].

⁵¹ See, for example, “What Does Participation in the NHCS [National Hospital Care Survey] Entail,” and related frequently asked questions at http://www.cdc.gov/nchs/nhcs/faq.htm [April 2017].

⁵² This is currently being done by the Billion Prices Project at the Massachusetts Institute of Technology; see http://bpp.mit.edu [April 2017].

Page 76 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Box III.3 Uses of Nontraditional Data for Federal Statistics: Examples

Health Insurance Claims Records BEA’s health care satellite account, first released in 2015 for 2000–2010 and currently available through 2013, estimates spending by disease treatment rather than provider type, using a blend of Medical Expenditure Panel Survey data, Medicare claims records, and Truven Health MarketScan® Commercial Claims and Encounters Database. See https://bea.gov/national/health_care_satellite_account.htm.

Sales and Transactions Data Aggregators BEA and the Census Bureau’s Economic Directorate are conducting pilot projects on the use of third-party aggregators of scanned store sales and credit card transactions to improve monthly retail trade estimates. These data are currently provided by the Monthly Retail Trade Survey, which has experienced significant declines in response rates. Initial results of the pilot projects show close correlations between the transactions data and survey data. The two agencies believe it may be possible to use the transactions data to enhance early estimates of consumer spending and produce estimates for metropolitan areas. See https://www.bea.gov/about/pdf/2016%20BEA%20Strategic%20Plan.pdf.

Web Scraping In the BLS consumer prices program, the web is scraped for characteristics of selected products (e.g., televisions, camcorders, cameras, washing machines) to use in models for quality adjustments. See https://www.bls.gov/cpi/cpihqaitem.htm.

NOTE: All URL addresses were active as of April 2017. BEA, Bureau of Economic Analysis; BLS, Bureau of Labor Statistics.

which nontraditional data sources can contribute to their programs. Such means could include: (1) augmenting information obtained from traditional sources; (2) replacing information elements previously obtained from traditional sources; (3) providing earlier estimates that are later benchmarked with traditional sources; and (4) analyzing information streams to identify needed changes (e.g., in types of jobs, education majors) in statistical classifications and survey questions.

Just as more and more surveys use multiple data collection modes, so a growing number of statistical programs will likely benefit from using multiple data sources, including nontraditional sources. To garner acceptance by policy makers and the public, statistical agencies should invest resources in documentation and user training and education. Agencies may need to “wall off” data series that are derived from nontraditional sources by labeling them as experimental or for research use until their statistical characteristics can be fully understood. If it is not possible to evaluate a nontraditional source sufficiently to establish its quality and suitability for inclusion in a statistical program, then a statistical agency should not use the data, although it may assist users by informing them of the problems with the source.

Page 77 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

INTEGRATION AND SYNCHRONIZATION OF DATA ACROSS AGENCIES

Statistical agencies that collect similar information should pursue integration of their microdata records for specified statistical uses as another way to improve data quality, develop new kinds of information, and increase cost-effectiveness. One cost-effective approach is for a large survey to provide the sampling frame and additional content for a smaller, more specialized survey. Currently, the National Health Interview Survey of NCHS serves this function for the Medical Expenditure Panel Survey of the Agency for Healthcare Research and Quality. Similarly, the American Community Survey serves this function for the National Survey of College Graduates that the Census Bureau conducts for the National Center for Science and Engineering Statistics (see National Research Council, 2008b).

Another cost-effective approach is to synchronize or harmonize similar data held by different agencies. For example, both the Bureau of Labor Statistics (BLS) and the Census Bureau maintain business establishment lists. The lists derive from different sources (state employment security records for BLS and a variety of sources, including federal income tax records, for the Census Bureau). Research has demonstrated that synchronization of the lists would improve the accuracy of the information and the coverage of business establishments in the United States (National Research Council, 2006b, 2007a).

A major step toward synchronization was taken in the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002. The act authorized the synchronization of business data among the three principal statistical agencies that produce the nation’s key economic statistics—the Bureau of Economic Analysis (BEA), BLS, and the Census Bureau (see Appendix B). However, synchronization of business establishment lists between BLS and the Census Bureau cannot currently be done because the Census Bureau is prohibited by law (Title 26 of the U.S. Code) from sharing with BLS (or BEA) any tax information of businesses or individuals that it may acquire from the Internal Revenue Service (IRS), even for statistical purposes.⁵³

TOWARD THE PARADIGM OF MULTIPLE DATA SOURCES

There have been several initiatives in the past decade to further the use of multiple data sources by statistical agencies. The Federal Committee on Statistical Methodology established a Subcommittee on

__________________

⁵³ Efforts have been under way since CIPSEA was enacted to permit business data synchronization involving IRS records, but, to date, it has not occurred.

Page 78 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

Administrative Records in 2008, subsequently renamed the Subcommittee on Administrative, Alternative, and Blended Data, which to date has developed examples and protocols for accessing, using, and evaluating administrative records (Federal Committee on Statistical Methodology, 2009, 2013). OMB guidance issued in 2014 states that federal administrative records as a matter of practice should be considered for federal statistics (U.S. Office of Management and Budget, 2014a; see also Appendix A). The United States participates in the United Nations Global Working Group on Big Data for Official Statistics, established in 2014.⁵⁴ The bipartisan 15-member commission created by the Evidence-Based Policymaking Commission Act of 2016 will report in September 2017 on such topics as how to integrate administrative and survey data and make them available for research and evaluation while protecting privacy and confidentiality; how data infrastructure, database security, and statistical protocols should be modified toward these ends; and whether a federal clearinghouse should be created for government survey and administrative data.⁵⁵

In moving to expand their use of administrative records or other nontraditional data sources together with surveys, statistical agencies should consider at least six factors in assessing the benefits and costs:

the need for upfront investment to facilitate the most effective approach to acquisition and use of administrative records or another nontraditional data source, accompanied by estimates of the likely longer term cost savings or reductions in respondent burden (or both) and timeliness or data quality improvements (or both);
the protocols and criteria to follow to ensure full understanding by the statistical agency of the properties of a specific nonsurvey data source (e.g., its population coverage, frequency of updating): see the quality frameworks discussed in Practice 9;
changes to established processing and estimation methods required to incorporate a nonsurvey source to maximize quality and timeliness of the resulting estimates: see Practice 10;
the means by which the confidentiality of linked or augmented datasets can be protected while allowing access for research purposes: see National Research Council (2005b) and Practice 8;
the requirements for expanded documentation, user outreach, and user education to assure acceptance of the resulting estimates and understanding of their strengths and limitations: see Practice 4; and

__________________

⁵⁴ See https://unstats.un.org/bigdata/ [April 2017].

⁵⁵ See www.cep.gov [April 2017].

Page 79 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×

the risks to the availability, consistency, and quality of statistical series if a provider discontinues or changes a nontraditional data source in significant ways and how those risks can be mitigated.

In the United States, a challenge that must first be addressed is the legality and feasibility of acquisition of data sources not already “owned” by a statistical agency. National Academies of Sciences, Engineering, and Medicine (2017b:1) concluded that “legal and administrative barriers limit the statistical use of administrative datasets by federal statistical agencies.” The problem is particularly acute for records for which states have custody; incentives (as some federal-state cooperative statistics programs provide) are likely needed for states to share their records for federal statistical purposes. For nontraditional data sources, such as scanner data and the like, the problems are more that such data vary greatly in their fitness for use in official statistics.

National Academies of Sciences, Engineering, and Medicine (2017b) recommended stepped-up efforts led by the Interagency Council on Statistical Policy to coordinate research to evaluate different nontraditional data sources, develop appropriate estimation methods based on combining data, develop quality metrics for combined estimates, implement state-of-the-art data privacy protection techniques, and similar matters (see Practice 13). The report also recommended (p. 2) that “a new entity or an existing entity should be designated to facilitate secure access to data for statistical purposes to enhance the quality of federal statistics.” The report argues that the current system—whereby individual statistical agencies must negotiate separately with other federal agencies, state agencies, and the private sector—is burdensome on all parties and precludes the necessary economies of scale for resolving the common challenges to realizing the potential of multiple data sources for statistics that meet user needs.

Page 80 Cite

Suggested Citation:"Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs." National Academies of Sciences, Engineering, and Medicine. 2017. Principles and Practices for a Federal Statistical Agency: Sixth Edition. Washington, DC: The National Academies Press. doi: 10.17226/24810.

×