Skip to main content

Currently Skimming:

2 Types of Data and Methods for Combining Them
Pages 25-54

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 25...
... , which discusses issues of acquiring data, data governance, and key desired attributes of a new data infrastructure. Many of the data sources considered contain records on individual persons or businesses, raising concerns about informed consent, privacy, and confidentiality.
From page 26...
... Table 2-1 summarizes features of these data types that are related to their fitness for use. Probability Samples The 1930s were a period of devastating poverty and unemployment in the United States.
From page 27...
... population may be the set of records, of population. have high coverage; underrepresented but administrative Transaction records traffic sensors may because of records often have from a credit card cover only specific undercoverage or undercoverage of the company contain highways; cell phone nonresponse (see population of interest.
From page 28...
... social media account information. Potential for High High High if population and High Low, unless information Combining subgroups are well can be verified from Statistics defined.
From page 29...
... The U.S. Census Bureau alone conducts more than 100 surveys of households and businesses each year.5 Despite their widespread use, however, probability surveys face challenges that diminish their usefulness as a sole source of information.
From page 30...
... . Undercoverage and low response rates are of concern because enti ties that cannot be selected for the sample or that fail to respond to a survey can differ systematically from entities that participate, causing survey estimates to be biased.
From page 31...
... ; CPS Monthly = Current Population Survey Monthly (Chapters 2, 5) ; CPS ASEC = Current Population Survey Annual Social and Economic Supplement (Chapter 5)
From page 32...
... Response rates for the Household Pulse Surveys conducted between April and July 2020 were less than five percent.8 will appear in the final sample. See Mercer, Lau, & Kennedy (2018)
From page 33...
... Probability surveys have provided the nation with useful statistics on numerous topics for more than 80 years, and the panel anticipates that these surveys will continue to be used to produce statistics in many topic areas, particularly at the national level. Some statistics, such as the percentage of persons looking for work last week or the percentage of criminal victimizations reported to the police, rely on information that can only be provided by individuals in the population -- a probability survey is often
From page 34...
... Alternative data sources may be able to improve the accuracy, timeli ness, and granularity of statistics while reducing costs. CONCLUSION 2-1: Probability surveys still have an important role to play in the production of official statistics but face challenges from non response and high costs.
From page 35...
... Bureau of Justice Statistics (2021b) ; FBI Crime Data Explorer, https://cde.ucr.cjis.gov 10 Evidence that many needy families do not participate comes from probability surveys (see Chapters 3, 5)
From page 36...
... . Some of these data sources are similar in structure to administrative records: examples include credit card transactions, electronic health records, grocery store scanner data, point-of-sale retail sales data, and stock market transactions.
From page 37...
... Department of Agriculture uses grocery store scanner data and other private-sector data sources to study food access, health, and ­security.12 The U.S. Bureau of Labor Statistics has been exploring the use of scanner and other private-sector data to supplement or replace information for the Consumer Price Index that is currently collected through surveys (Konny, Williams, & Friedman, 2022; NASEM, 2022d)
From page 38...
... . Wearable fitness trackers generate data about wearers' heart-rate read ings, sleep time, menstrual cycles, step counts, locations, and more.
From page 39...
... Examples of convenience samples 14 One way to obtain fitness-device data that can be generalized to the population is to ask persons in a probability survey to wear the devices. If everyone agrees, then one has a probability sample of persons wearing fitness trackers.
From page 40...
... .15 Researchers who estimate population character istics from a convenience sample typically use statistical methods akin to nonresponse-adjustment methods for probability surveys, and they make the strong assumption that the methods remove any bias resulting from the volunteer nature of the sample (see Wu, 2022, and the literature refer enced therein)
From page 41...
... for discussions of these and other issues with social media data. These data sources, like many other convenience samples, can also be manipulated by an outside actor, which would be of particular concern for data used to produce official statistics.
From page 42...
... . CONCLUSION 2-2: Numerous data sources, including probability samples, administrative records, and private-sector data, could be used to produce official statistics if they meet standards for quality.
From page 43...
... Record-linkage techniques can be used to: • Add variables measured in other data sources to the variables measured in a primary data source. This was the goal of the 1973 CPS record-linkage project mentioned above, and linkage allows researchers to study relationships among variables measured on the same individuals in separate data sources.
From page 44...
... Pairs with scores between the two cutoff values may undergo further review before a determination is made. When a probabilistic linkage method is used, uncertainty about linkages can also be incorporated into standard errors of statistics.
From page 45...
... • Create longitudinal datasets by linking records belonging to the same person over time, for example, merging high school records with information on college completion. • Check the accuracy of information in a data source by comparing it to other sources.
From page 46...
... Combining statistics can improve population coverage when data sources include information on different subsets of the population. For example, income tax records have detailed information about filing units (­often, but not necessarily, households)
From page 47...
... If a survey participant is accurately linked with an administrative record, the merged data can be analyzed as if all variables came from 19 Measurement errors can affect multiple-frame probability surveys, too. If list frame respondents provide data over the internet, and area frame respondents are interviewed in person, the differing modes of data collection may affect responses.
From page 48...
... ; and Wu (2022) reviewed methods that use statistical models to combine information from multiple data sources.
From page 49...
... program uses statistical models to compute estimates for small areas (areas in which the survey sample size is too small to calculate a reliable estimate using the survey data alone)
From page 50...
... Imputation can also combine information from multiple data sources. For example, Raghunathan et al.
From page 51...
... CONCLUSION 2-4: Statistical methods such as small area estimation, imputation, and combining statistics for subpopulations can integrate information from multiple data sources without requiring individual records to be linked. 2.3  OPPORTUNITIES AND CHALLENGES FOR COMBINING DATA FROM MULTIPLE SOURCES As will be demonstrated in Chapters 4–8, multiple data sources can improve the timeliness, granularity, and usefulness of data for estimating statistics currently calculated from probability surveys.
From page 52...
... Geographic information may be inaccurate or unavailable in social media data or other data sources. • Entity alignment.
From page 53...
... Many of the new data sources and methods for data combination require skill sets beyond those needed for conducting traditional probability surveys. These include expertise in record linkage, statistical modeling for combining data, machine learning, data quality assessment, computer science, information systems management, and remote sensing technology.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.