Practice 3: Use of Multiple Data Sources for Statistics That Meet User Needs
FEDERAL STATISTICAL AGENCIES AND PROGRAMS cannot be static. To provide information of continued relevance for public and policy use, they must regularly engage with a broad range of users to identify emerging needs and look for ways to develop new information that can serve broad purposes. To improve the quality and timeliness of their products, they must keep abreast of methodological and technological advances and be prepared to implement new procedures in a timely manner (see Practice 9). They must also continually seek ways to make their operations more efficient and less burdensome (see Practice 10).
Preparing for the future requires that agencies periodically assess the justification, scope, and frequency of existing data series, plan new or modified data series as required, and be innovative and open to new ways to improve their programs. Because of the decentralized nature of the federal statistical system, innovation often requires cross-agency collaboration (see Practice 13) and a willingness to implement different kinds of data collection efforts to answer different needs.
Two changes in policy and outlook can help statistical agencies foster the needed spirit of innovation. The first is to focus on the desired outputs of their programs by defining their primary business as that of providing relevant, accurate, and timely statistics obtained in a cost-effective manner. An output-oriented focus should help agencies justify and implement difficult decisions to modify or replace data collection and estimation programs that have lost their relevance, timeliness, or accuracy.
The second is to adopt as a matter of stated policy a paradigm of using multiple data sources to generate needed information as an expansion of the long-dominant paradigm of using probability sample surveys. This new paradigm, which federal statistical agencies are already embracing, recognizes the continued importance of surveys, both cross-sectional and longitudinal. At the same time, it explicitly recognizes the roles of administrative records and other third-party sources, along with the use of new methods for combining data from multiple sources, as key elements of a cost-effective strategy to serve users’ needs.42
In considering new data collection, estimation, and dissemination strategies for the future, statistical agencies must be mindful of tradeoffs among relevance, accuracy, timeliness, comparability over time and with other data sources, transparency, costs, and respondent burden. It will not usually be possible to maximize all seven criteria at the same time, given constrained budgets, but using multiple data sources will enable statistical agencies to better address this challenge.
ROLES FOR SURVEYS
Many current statistical programs rely on well-established probability sampling methods that draw representative samples of a population, such as household members or business establishments, interview the sample units, and produce estimates that account for known errors in population coverage and missing data and have a quantifiable level of uncertainty from sampling variability. Box III.1 provides a brief history of probability sampling for federal statistics and lists examples of long-running federal surveys.
Declining rates of response over the past 30 years in the United States (and in other countries), however, are making it increasingly difficult to contain the costs of data collection with traditional surveys in ways that do not risk compromising the quality of the data (see, e.g., Brick and Williams, 2013; de Leeuw and de Heer, 2002).43 User demands for timeliness and granularity of estimates also strain the ability of statistical agencies to respond using established survey techniques.
Survey researchers are actively seeking ways to maintain and improve both the quality and the cost-effectiveness of surveys. For example, more surveys are using multiple modes to facilitate response (Internet,
42 See Lohr and Raghunathan (2017) on methods for combining survey and nonsurvey data with examples of applications.
43 Lower response rates reduce the effective sample size and increase the sampling error of survey estimates; lower rates may also increase response bias in survey estimates.
smartphone, telephone, mail, in person), as well as using paradata44 to improve survey operations and facilitate “responsive” or “adaptive” survey designs (see National Research Council, 2013a).
Surveys should remain an important component of federal statistical agencies’ portfolios for two major reasons: (1) some information is not readily ascertained except by asking questions; and (2) surveys can collect information on many characteristics at the same time, thereby permitting rich multivariate analysis. Yet the challenges to the survey paradigm make it essential to consider how use of other data sources can bolster the completeness, quality, and utility of statistical estimates while containing costs and reducing respondent burden (see National Academies of Sciences, Engineering, and Medicine, 2016, 2017b).
ROLES FOR ADMINISTRATIVE RECORDS
Administrative records include records of federal, state, and local government agencies that are used to administer a government program. Examples include U.S. Social Security Administration records of payroll taxes collected from workers and benefits paid out to retirees and other beneficiaries; state agency records of information provided by applicants for assistance programs and payments to applicants deemed eligible; and property tax records of local governments.
Administrative records are not generated probabilistically, as are surveys, but they are not unlike household or business censuses and can be evaluated
44 Paradata are data about the source that are gathered in real time, such as the length of time to complete a survey.
in similar ways. Administrative records are designed to capture information for all instances of a specified population (e.g., program beneficiaries) according to a set of rules typically based in statute or regulation, and, like censuses, they may have omissions or duplications, and the variables in the records may differ in accuracy.45 The records may also be stored in difficult-to-use formats, not well documented, or not provided on a timely basis. Acquiring the records requires negotiations with the custodial agency, and their contents may change when program rules change. Yet efforts to develop error profiles for administrative records (see Practice 9) and productive relationships with the custodial agency (see Practice 7) can have sizable payoffs, as is evident in several well-established statistical uses of records.
For example, administrative records are used to generate up-to-date population estimates by age, race and ethnicity, and gender. In turn, these estimates are used to adjust population survey weights for coverage errors and for many other purposes.46 Tax records are used instead of questionnaires for the Census Bureau’s economic censuses and surveys for nonemployer businesses. Administrative records are more and more used with survey data to produce model-based estimates with improved accuracy for small geographic areas or population groups.47
There are many other statistical uses that agencies should consider for administrative records, for which the investment in data collection has already been made. In some instances, records could improve the cost-effectiveness and data quality of current statistical programs (e.g., by substituting administrative records for survey questions). In other instances, they could add richness to the combined dataset (e.g., by appending administrative records variables to matched survey records).48Box III.2 presents examples of innovative uses of administrative records for federal statistics. National Research Council (2009e) provides a comprehensive strategy for using administrative records to improve income information in the Census Bureau’s Survey of Income and Program Participation.
45 For example, payments to beneficiaries may be more accurate than information provided at the time of application regarding a beneficiary’s characteristics.
46 See, e.g., National Research Council (2004a, 2007b).
47 See, e.g., National Research Council (2000c,d), on the Census Bureau’s Small-Area Income and Poverty Estimates (SAIPE) Program and recommended improvements.
48 Extant matches include: (1) matches of Social Security earnings histories and Medicare benefits with the Health and Retirement Study and other surveys to analyze retirement decisions and the effect of medical care use on income security (see National Research Council, 1997a; National Research Council and Institute of Medicine, 2012); and (2) matches of employer and employee survey data with state employment security agency records in the Census Bureau’s Longitudinal Employer-Household Dynamics Program to analyze business and employment dynamics (see National Research Council, 2007a). Access to matched datasets must be restricted to protect confidentiality (see Practice 8).
Some uses may require not only state-of-the-art confidentiality protection techniques (see Practice 8), but also explicit legal authority (see “Toward the Paradigm of Multiple Data Sources,” below).49
ROLES FOR NONTRADITIONAL DATA SOURCES
Statistical agencies are currently exploring the use of data sources, other than surveys and administrative records, that hold promise to improve the
49 Statistics Canada, the national Canadian statistical agency, which has full access to administrative datasets, asks respondents to the Canadian Income Survey (and its predecessor) for permission to use income tax records in place of questions: see http://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&SDDS=5200 [April 2017].
relevance, accuracy, and timeliness of federal statistics. These nontraditional data sources include, among others, data gleaned from relevant Internet websites (e.g., price quotes or social media postings), extracted from sensors (e.g., traffic cameras), and obtained from the private sector (e.g., scanner data on consumer purchases). Often, these sources generate large volumes of data that require data mining and other computationally intensive techniques for extracting information (see National Research Council, 2008a, esp. App. H). 50
Some agencies are already using nontraditional data sources. For example, the Economic Research Service (in the U.S. Department of Agriculture) obtains expenditure data scanned by households from store receipts from a private vendor and has evaluated the quality of the data (Muth et al., 2016). The National Center for Health Statistics (NCHS), in its surveys of hospitals and other health care providers, obtains data from questionnaires, abstracts of samples of patient records, and providers’ electronic medical care claim records.51 See Box III.3 for other examples of current uses of nontraditional data sources.
Most nontraditional data sources present significant challenges to statistical agencies to evaluate the accuracy and error properties of the information. For example, harvesting website data to develop up-to-the-minute consumer price indexes52 may offer significant timeliness and cost savings compared with traditional methods, but it is not clear how to adjust these data for consumer expenditures that occur off-line so that they accurately represent the universe of purchases. More generally, information that is taken from the Internet cannot usually be described or evaluated according to either a probability survey paradigm or a rules-based administrative records paradigm. For example, people who post items to sell on an auction website do not comprise any specified population. Another challenge of nontraditional data is that statistical agencies lack control over the consistency over time or among vendors or sites, so that deciding to rely heavily on such data sources carries high risks of compromising key time series if a vendor or site ceases operation or there are marked changes in data content or population coverage.
Yet in an era when data users expect timeliness and when budgets are constrained, statistical agencies should actively explore means by
50 Such data are often referred to as “big data,” which are characterized by high volume, velocity, and variety and require new tools and methods to capture, curate, manage, and process in an efficient way. See https://unstats.un.org/bigdata/ and https://unstats.un.org/unsd/statcom/doc15/BG-BigData.pdf [April 2017].
51 See, for example, “What Does Participation in the NHCS [National Hospital Care Survey] Entail,” and related frequently asked questions at http://www.cdc.gov/nchs/nhcs/faq.htm [April 2017].
52 This is currently being done by the Billion Prices Project at the Massachusetts Institute of Technology; see http://bpp.mit.edu [April 2017].
which nontraditional data sources can contribute to their programs. Such means could include: (1) augmenting information obtained from traditional sources; (2) replacing information elements previously obtained from traditional sources; (3) providing earlier estimates that are later benchmarked with traditional sources; and (4) analyzing information streams to identify needed changes (e.g., in types of jobs, education majors) in statistical classifications and survey questions.
Just as more and more surveys use multiple data collection modes, so a growing number of statistical programs will likely benefit from using multiple data sources, including nontraditional sources. To garner acceptance by policy makers and the public, statistical agencies should invest resources in documentation and user training and education. Agencies may need to “wall off” data series that are derived from nontraditional sources by labeling them as experimental or for research use until their statistical characteristics can be fully understood. If it is not possible to evaluate a nontraditional source sufficiently to establish its quality and suitability for inclusion in a statistical program, then a statistical agency should not use the data, although it may assist users by informing them of the problems with the source.
INTEGRATION AND SYNCHRONIZATION OF DATA ACROSS AGENCIES
Statistical agencies that collect similar information should pursue integration of their microdata records for specified statistical uses as another way to improve data quality, develop new kinds of information, and increase cost-effectiveness. One cost-effective approach is for a large survey to provide the sampling frame and additional content for a smaller, more specialized survey. Currently, the National Health Interview Survey of NCHS serves this function for the Medical Expenditure Panel Survey of the Agency for Healthcare Research and Quality. Similarly, the American Community Survey serves this function for the National Survey of College Graduates that the Census Bureau conducts for the National Center for Science and Engineering Statistics (see National Research Council, 2008b).
Another cost-effective approach is to synchronize or harmonize similar data held by different agencies. For example, both the Bureau of Labor Statistics (BLS) and the Census Bureau maintain business establishment lists. The lists derive from different sources (state employment security records for BLS and a variety of sources, including federal income tax records, for the Census Bureau). Research has demonstrated that synchronization of the lists would improve the accuracy of the information and the coverage of business establishments in the United States (National Research Council, 2006b, 2007a).
A major step toward synchronization was taken in the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) of 2002. The act authorized the synchronization of business data among the three principal statistical agencies that produce the nation’s key economic statistics—the Bureau of Economic Analysis (BEA), BLS, and the Census Bureau (see Appendix B). However, synchronization of business establishment lists between BLS and the Census Bureau cannot currently be done because the Census Bureau is prohibited by law (Title 26 of the U.S. Code) from sharing with BLS (or BEA) any tax information of businesses or individuals that it may acquire from the Internal Revenue Service (IRS), even for statistical purposes.53
TOWARD THE PARADIGM OF MULTIPLE DATA SOURCES
There have been several initiatives in the past decade to further the use of multiple data sources by statistical agencies. The Federal Committee on Statistical Methodology established a Subcommittee on
53 Efforts have been under way since CIPSEA was enacted to permit business data synchronization involving IRS records, but, to date, it has not occurred.
Administrative Records in 2008, subsequently renamed the Subcommittee on Administrative, Alternative, and Blended Data, which to date has developed examples and protocols for accessing, using, and evaluating administrative records (Federal Committee on Statistical Methodology, 2009, 2013). OMB guidance issued in 2014 states that federal administrative records as a matter of practice should be considered for federal statistics (U.S. Office of Management and Budget, 2014a; see also Appendix A). The United States participates in the United Nations Global Working Group on Big Data for Official Statistics, established in 2014.54 The bipartisan 15-member commission created by the Evidence-Based Policymaking Commission Act of 2016 will report in September 2017 on such topics as how to integrate administrative and survey data and make them available for research and evaluation while protecting privacy and confidentiality; how data infrastructure, database security, and statistical protocols should be modified toward these ends; and whether a federal clearinghouse should be created for government survey and administrative data.55
In moving to expand their use of administrative records or other nontraditional data sources together with surveys, statistical agencies should consider at least six factors in assessing the benefits and costs:
- the need for upfront investment to facilitate the most effective approach to acquisition and use of administrative records or another nontraditional data source, accompanied by estimates of the likely longer term cost savings or reductions in respondent burden (or both) and timeliness or data quality improvements (or both);
- the protocols and criteria to follow to ensure full understanding by the statistical agency of the properties of a specific nonsurvey data source (e.g., its population coverage, frequency of updating): see the quality frameworks discussed in Practice 9;
- changes to established processing and estimation methods required to incorporate a nonsurvey source to maximize quality and timeliness of the resulting estimates: see Practice 10;
- the means by which the confidentiality of linked or augmented datasets can be protected while allowing access for research purposes: see National Research Council (2005b) and Practice 8;
- the requirements for expanded documentation, user outreach, and user education to assure acceptance of the resulting estimates and understanding of their strengths and limitations: see Practice 4; and
54 See https://unstats.un.org/bigdata/ [April 2017].
55 See www.cep.gov [April 2017].
- the risks to the availability, consistency, and quality of statistical series if a provider discontinues or changes a nontraditional data source in significant ways and how those risks can be mitigated.
In the United States, a challenge that must first be addressed is the legality and feasibility of acquisition of data sources not already “owned” by a statistical agency. National Academies of Sciences, Engineering, and Medicine (2017b:1) concluded that “legal and administrative barriers limit the statistical use of administrative datasets by federal statistical agencies.” The problem is particularly acute for records for which states have custody; incentives (as some federal-state cooperative statistics programs provide) are likely needed for states to share their records for federal statistical purposes. For nontraditional data sources, such as scanner data and the like, the problems are more that such data vary greatly in their fitness for use in official statistics.
National Academies of Sciences, Engineering, and Medicine (2017b) recommended stepped-up efforts led by the Interagency Council on Statistical Policy to coordinate research to evaluate different nontraditional data sources, develop appropriate estimation methods based on combining data, develop quality metrics for combined estimates, implement state-of-the-art data privacy protection techniques, and similar matters (see Practice 13). The report also recommended (p. 2) that “a new entity or an existing entity should be designated to facilitate secure access to data for statistical purposes to enhance the quality of federal statistics.” The report argues that the current system—whereby individual statistical agencies must negotiate separately with other federal agencies, state agencies, and the private sector—is burdensome on all parties and precludes the necessary economies of scale for resolving the common challenges to realizing the potential of multiple data sources for statistics that meet user needs.