Skip to main content

Currently Skimming:

2 Statistical Methods for Combining Multiple Data Sources
Pages 15-44

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 15...
... As noted in Chapter 1, that paradigm would shift from sole reliance on probability surveys to a system that relies on probability surveys along with administrative and private-sector data, making use of the strengths of each data source. We begin by describing statistics that are currently produced or might be desired and summarizing some of the features of data sources that might be combined to produce those statistics.
From page 16...
... Probability surveys can be designed to measure the specific concepts of interest, but they are expensive, particularly those conducted through face-to-face interviews. As discussed in the panel's first report, both costs and nonresponse rates for probability surveys have increased in recent years.
From page 17...
... and this oversampling of smaller states allows the CPS to produce reliable statelevel estimates, but it makes the design less efficient for producing national estimates because adults in large states are less likely to be included in the sample than adults in small states. The design of the National Crime Victimization Survey (NCVS)
From page 18...
... Administrative and private-sector data sources already exist and the cost to use them for statistical purposes may be lower than the cost to collect additional data from probability surveys. Nonsurvey data sources can also provide a fresh perspective on the redesign of federal surveys.
From page 19...
... . Combining survey data with other data sources, or combining multiple administrative data sources, has many potential advantages over the survey paradigm.
From page 20...
... Census Bureau as main data collection organizations. MEPS provides a model for combining data sources, combining information across person, household, and provider level, and using information from parts of one component as a source of information for other components.
From page 21...
... Her arguments can be extended to nonfederal and non-administrative data sources as well. Most household surveys currently use methods 1 and 2, and some surveys use or are exploring methods 3 through 8 to make more efficient use of data from other sources.
From page 22...
... The National Center for Health Statistics was able to replace the National Nursing Home Survey and the National Home and Hospice Survey beginning in 2012 with administrative data from the Centers for Medicare & Medicaid Services. In yet other situations, it may be possible to combine information from administrative data sources with information from surveys.
From page 23...
... linked records from the tumor registry of Group Health Cooperative, a health insurance company, with records from the Washington State Cancer Registry. The record linkage enabled the researchers to identify and remove duplicated records from the concatenated databases, adding 35,166 new tumor cases from the registry to the Group Health Cooperative database.
From page 24...
... At the same time, record linkage methods come with concerns. Linked records have more information about individuals than the original data sources, which raises privacy concerns.
From page 25...
... in the linkage process (see, e.g., Schmidlin et al., 2015) , record linkage may represent increased privacy risks to entities in the linked data sources.
From page 26...
... Personally identifying information used in linkage includes name, date of birth, Social Security Number and/or Medicare number, race, sex, state of birth, and state of resi dence. If an agency is able to find a survey participant in its own data files, information can be sent back to NHIS and linked with the original survey data.
From page 27...
... The feasibility files contain infor mation about a survey participant's eligibility for linkage and whether a participant was successfully linked to an administrative data source, but do not contain any information about benefits or payments. Public-use linked mortality files containing a limited set of mortality variables for adult survey participants are also available for download from the NCHS Data Linkage website.e Many researchers have used the linked data sources to investigate mortality and health care costs for NHIS respondents.
From page 28...
... at the Census Bureau has developed a probabilistic record linkage system in which a protected identification key (PIK) is created for each entity and the PIK is used to link records from different sources behind a secure firewall.
From page 29...
... 4 links survey responses from a probability sample of approximately 5,000 households with administrative data on SNAP participation and purchases, as well as information about the food items and prices that are accessible to the surveyed households. The linked information from SNAP is used to determine SNAP eligibility in the 30 days prior to the survey, resolve data discrepancies, and provide information on usage of the electronic benefit transfer card (U.S.
From page 30...
... Multiple Frame Methods Record linkage usually requires that data for individual entities be available from the data sources, along with sufficient identifying information to allow records to be matched. For example, individual property tax records from county assessors are available on the Internet and can be linked with address-based records from survey data.
From page 31...
... Adjustments are made to the weights of households with both landline and cell phones so that they represent that part of the population in the combined samples. Multiple frame surveys are often used in situations in which the frames 6  sampling frame is a list of population units from which the sample is drawn or a method A for describing the population.
From page 32...
... The first frame was the NASS list frame. The second frame, containing potential local food operations, was derived from web-based information and was used to measure coverage of the first frame.8 Multiple frame surveys can increase coverage of the population, and they have the potential to reduce costs if one or more of the frames is inexpensive to sample from.
From page 33...
... In this situation, the problem of combining information can be viewed as a missing data problem, and imputation methods can be used to fill in or impute the missing values in the combined dataset. TABLE 2-1  Information from Three Sources, A, B, and C Medical Smoking Source ID Age Sex Expenditures Status Records Linked from A X X X X X and B Records from A with No X X X Linked Record from B Records from B with No X X X X Linked Record from A Records from C X X X
From page 34...
... Suppose that data source A contains demographic variables and information on health care expenditures, and data source B contains demographic variables and information on exercise habits for a different set of people. Statistical matching methods (Rodgers, 1984; Moriarity and Scheuren, 2001)
From page 35...
... Spatial and temporal components could be added by linking satellite imageries, environmental monitors, and weather and climate data. This dynamic linking of multiple surveys and administrative data sources could create spatiotemporal data representing the U.S.
From page 36...
... In addition, other statistical modeling methods can be used to combine aggregated statistics with each other or with individual record data when the data sources measure different variables. Small-area estimation methods are examples of statistical models that combine statistics estimated from a probability survey with statistics calculated from administrative data (see National Academies of Sciences, Engineering, and Medicine, 2017b, Box 3-3)
From page 37...
... . There are numerous examples throughout the federal statistical system, as noted above, in such areas as crime and victimization rates, health status, and economic activity: multivariate hierarchical models can be used to combine data from multiple sources to create a systematic program of small-area estimation.
From page 38...
... .11 However, when combining data from survey and nonsurvey sources, model assumptions may be needed for inference because the nonsurvey data sources lack a probabilistic selection structure for the units in the dataset (Elliott and Valliant, 2017)
From page 39...
... Many of those methods have been developed to augment data collected from the probability surveys that currently form the backbone of the federal statistical system. Some of the methods -- notably, record linkage -- can be applied to administrative and commercial data sources as well as to probability surveys.
From page 40...
... This can be done through record linkage; through multiple frame methods, in which it may be possible to identify the overlapping parts of the population; or through modeling and imputation methods, in which relationships among variables can be used both to study biases in different sources and to fill in missing values. More research is needed on other methods that can deal with missing data from multiple sources.
From page 41...
... Statistical methods in use for surveys typically produce static estimates: for example, the National Crime Victimization Survey produces estimates of victimization rates in each calendar year, and the CPS produces monthly unemployment statistics. Administrative data records and sensor data, however, may be updated much more frequently.
From page 42...
... RECOMMENDATION 2-3 Current statistical methods should be adapted to the extent possible and new methods should be developed to harness the statistical information from multiple data sources for analysis. Structure Needed for Implementation Though statistical methods for record linkage, multiple frame surveys, imputation, machine learning techniques, and hierarchical models for combining data are available, many of them need further research and adaptation.
From page 43...
... Such research will require the resources of multiple agencies, as well as cooperation with academia and businesses. The skills needed for the research include the traditional skills in probability survey design and analysis, but they also include knowledge of record linkage, machine learning, new statistical modeling techniques, and privacy expertise.
From page 44...
... RECOMMENDATION 2-5 Federal statistical agencies should develop partnerships with academia and external research organizations to develop methods needed for design and analysis using multiple data sources.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.