National Academies Press: OpenBook

Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy (2017)

Chapter:4 Using Private-Sector Data for Federal Statistics

« Previous: 3 Using Government Administrative and Other Data for Federal Statistics
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

4

Using Private-Sector Data for Federal Statistics

Recent years have witnessed an explosion of data from many sources, some of which are referred to as “big data” (e.g., Daas et al., 2015; Manyika et al., 2011). The term refers to the vast amounts of data that are now available in electronic form and are potentially accessible to analysis, including data that previously existed but were not centrally accessible (such as sales data and medical records) and new kinds of data for phenomena that were not previously measured on a consistent basis but now can be, using new kinds of measurements (such as sensors of natural and artificial phenomena—weather and traffic). IBM has estimated that 2.5 exabytes (2.5 million terabytes) of data are produced every day.1 As a comparison, the U.S. Library of Congress has roughly estimated that its entire printed collection of 26 million volumes totals 208 terabytes.2 Some of these new data come from digital records of government agencies (e.g., the health care transaction records of the Centers for Medicare & Medicaid Services). But many of them come from private-sector enterprises (e.g., Manyika et al., 2011). Indeed, a whole set of new enterprises are using large digital data resources as the basis of their business models (e.g., Uber, AirBnB, LinkedIn).

For this report’s purpose, we consider two kinds of private-sector data: private-sector structured data and private-sector data that have high dimensions, either in the number of observations or records or the number of

___________________

1 See https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html [November 2016].

2 See https://blogs.loc.gov/thesignal/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it [November 2016].

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

attributes of the observations. Examples of high-dimensional data include streaming data production (e.g., utility meters, traffic cameras, and other sensors), Internet behavior documentation (e.g., browser search terms), and social media postings (e.g., data from Twitter, Facebook, LinkedIn). Examples of structured data include consumer information data, such as those from Zillow and Experian and other credit bureau data.

Some of these new data—whether from government or private-sector sources—could be used to create new statistics by themselves; others could be and are being using in conjunction with traditional statistical data. Some are stored in a form that permits useful statistical analysis immediately; others are stored in forms that would require significant processing prior to their statistical use.

In this chapter we first review the different kinds of private-sector data that are available and how the characteristics of these data affect their potential utility and usability for federal statistics. Next we briefly review efforts by national statistical offices around the world to examine and experiment with using these data sources to produce official statistics. We then review current work in the United States to examine and evaluate these new data sources for federal statistics. We conclude the chapter with a discussion of the challenges in using these data for federal statistics, including issues of access and quality.

DIMENSIONS OF NEW DATA SOURCES

We distinguish three dimensions of the new data resources: who owns or controls them (i.e., government agencies (federal, state, local) or private-sector entities), the purpose for which the data were created (e.g., record transactions or output from sensors or to communicate with others through social media), and the form of the data as stored (i.e., structured numeric data, semi-structured data, unstructured text, pixel data). In this chapter we deal primarily with private-sector data. Table 4-1 details these categories of new data resources.

The data sources shown in Table 4-1 vary in their “readiness” for use in federal statistics in terms of the likely time and effort it would take to clean and format them in order to produce usable statistics. As the second column of Table 4-1 indicates, private firms use surveys to assess their customers’ satisfaction or conduct broader surveys of a target population for market research or to make estimates of media use (e.g., Nielsen). The weaknesses in the survey paradigm (see Chapter 2) have also become very evident to private survey firms, which have generally lower response rates than do government surveys. In fact, many firms have abandoned the probability survey paradigm for opt-in Internet panels (Baker et al., 2010).

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

TABLE 4-1 Types and Examples of Private-Sector Data Sources

Definition and Examples Structured Data from Censuses and Probability Surveys Structured Data from Administrative Records Other Structured Data Semi-Structured Data Unstructured Data
Definition Data from a population or a sample of that population used to estimate the population’s characteristics through the systematic use of statistical methodology Data collected by private companies from transactions, process control, or financial or human resource records Data that are highly organized and can easily be placed in a database or spreadsheet, though they may still require substantial scrubbing and transformation for modeling and analysis Data that have structure, but also permit flexibility in structure so that they cannot be placed in a relational database or spreadsheet; the scrubbing and transformation for modeling and analysis is usually more difficult than for structured data Data, such as in text, images, and videos, that do not have any structure so that information of value must first be extracted and then placed in a structured table for further processing and analysis
Private-Sector Examples
  • Customer satisfaction surveys
  • Marketing research surveys
  • Media use surveys
  • Academic surveys
  • Data produced by businesses
    • Commercial transactions
    • Banking and stock records
    • Credit card records
    • Medical records
  • University and other nonprofit grant transactions
  • E-commerce transactions
  • Mobile phone location sensors
  • Global Positioning System sensors
  • Utility company sensors
  • Weather, pollution sensors
  • Extensible Markup Language (XML) files
  • Data from computer systems
    • Logs
    • Web logs
  • Mobile phone content: text messages
  • E-mail
  • Internet of thingsa
  • Sport activity sensors (from watches, etc.)
  • Social network data (Facebook, Twitter, Tumblr, etc.)
  • Internet blogs and comments
  • Documents
  • Pictures (Instagram, Flickr, Picasa, etc.)
  • Videos (YouTube, etc.)
  • Internet searches
  • Traffic webcams
  • Security/surveillance videos/images
  • Satellite images
  • Drones
  • Radar images

aThe Internet of things refers to electronics embedded in devices and machines that allow them to be connected to the Internet to directly send and receive data.

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

In addition to government and private-sector data, surveys and censuses are also conducted by academic researchers. The data from these surveys are sometimes combined with administrative records to produce valuable information. For example, the Health and Retirement Study, conducted by the University of Michigan, obtains earnings records from the Social Security Administration and Medicare claims and summary information from the Centers for Medicare & Medicaid Services that are matched to respondents’ survey data to produce statistics and analysis about Americans’ physical and financial well-being.

As shown in the third column of Table 4-1, private firms also generate their own administrative records, which may be similar in structure to government administrative records. In the private sector, administrative records are often transactions, such as credit card purchase records or payroll documents. Sometimes these private-sector administrative records are used to produce statistics on their own, such as the National Employment Report from Automatic Data Processing, Inc. (ADP), which precedes the Bureau of Labor Statistics (BLS) release of the employment situation each month.3

The other three categories for private-sector data sources, shown in the last three columns of Table 4-1, vary in the structure of the data and how difficult they are to clean and transform into usable numeric form to produce statistics. By structured data we mean numeric data, often ordered into rectangular or fixed relational formats. The best structured data for statistical use have metadata attached to them, which document the format and meaning of each variable. However, even with these attributes, structured data generally need to be transformed for analytic purposes. Structured data in the private sector often include highly detailed geospatial data, such as those from mobile phones, traffic sensors, and Global Positioning System (GPS) devices, and these data may be available in real time. Some similar sensor data, including traffic monitoring sensors, may also be created by government agencies (see Table 3-1 in Chapter 3).

Semi-structured data can be best described as data that can be turned into formatted numeric data by being coded and classified into numeric categories based on information available from the unstructured data themselves. Examples of semi-structured data include Extensible Markup Language (XML) files, e-mails, documents, mobile data content, and log data from computer systems.

Unstructured data include digital videos, digitized pictures, and digital sound recordings, as well as digitized text. Some common forms of private-

___________________

3 ADP works in collaboration with Moody’s Analytics in using ADP’s large payroll dataset to predict private-sector employment prior to the BLS release. ADP processes the payrolls of about half a million private establishments in the United States, which employ nearly 20 percent of private-sector workers. Moody’s Analytics adjusts the ADP data to match those from BLS.

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

sector unstructured data include text data from social networks (Facebook, Twitter, etc.), pictures (Instagram, etc.), videos (YouTube, surveillance cameras, etc.), satellite images, traffic webcams, data from drones, etc. These data are often the most difficult data to scrub and transform for statistical purposes as they require complicated transformations based on the specific data source.

Overall, large amounts of high-dimensional data resources are held in the private sector by firms that are themselves information-based enterprises. This observation leads to issues of access for federal statistical purposes, which we address later in this chapter and further in Chapter 6. The table also makes clear that the new data resources arise not from the design of a statistician, but as part of other processes. Sometimes the processes generating the data produce information that may be relevant to official statistics, but this is not their primary purpose. Hence, although the data have been collected by these enterprises, they are not, for the most part, immediately usable for statistical purposes or analysis. For some, much processing work would have to be done to create structured numeric data that have statistical utility. Finally, because the data were not designed for a statistical purpose, they tend to be rather lean, that is, not consisting of a large number of attributes describing the measurement unit (e.g., a person or company). Instead, they measure only what is needed by the process producing them for the firm or other entity. Hence, there is a need to blend these new data resources with traditional survey data in new statistical analyses if they are to be used to improve any existing official statistics. Although blending data sources holds the potential to improve federal statistics, there is no guarantee that it will do so; thus, careful evaluation of data sources is necessary (see below).

USING PRIVATE-SECTOR DATA SOURCES FOR STATISTICS

The potential opportunities to use new data resources for building national statistics have been recognized by many countries of the world with the creation of the U.N. Working Group on Big Data4 in March 2014. The working group acknowledges that “using Big data for official statistics is an obligation for the statistical community based on the Fundamental Principles [of Official Statistics (see Box 2-1)] to meet the expectation of

___________________

4 The full members of the working group are Australia, Bangladesh, Cameroon, China, Colombia, Denmark, Egypt, Indonesia, Italy, Mexico, Morocco, Netherlands, Oman, Pakistan, Philippines, Tanzania, the United Arab Emirates, the United States, the U.N. Economic Commission for Europe, the U.N. Economic and Social Commission for Asia and the Pacific, the U.N. Global Pulse, the International Telecommunications Union, the Organization for Economic Cooperation and Development (OECD), the World Bank, and the Statistical Centre for the Cooperation Council for the Arab Countries of the Gulf.

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

society for enhanced products and improved and more efficient ways of working” (U.N. Economic and Social Council, 2014, p. 1). The goal of the group is to find promising uses of such data for official statistics, specifically focusing on uses for GPS devices, automated teller machines, scanning devices, sensors, mobile phones, satellites, and social media. The working group has created principles for access to big data sources to ensure fair treatment of businesses supplying these data.

To assess how national statistical offices are seeking to use these new data sources, the U.N. Statistical Commission (UNSC) conducted a survey of 93 national statistical offices. The national statistical offices of countries similar to the United States5 were most interested in using big data for “faster, more timely statistics” (88%), “reducing response burden” (75%), and creating “new products and services” (72%). These new products and more timely statistics were more important than other factors for the use of big data, such as “modernization of the statistical production process” (69%) and cost reduction (63%) (U.N. Economic and Social Council, 2016). Although many countries are interested in various big data sources for official statistics, very few have yet been able to actually produce official statistics based on these sources.

Academic and private-sector organizations have created statistics based on web-scraped data from e-commerce sites such as the Billion Prices Project (see Box 4-1). The project uses prices of products on the Internet to create a daily Consumer Price Index (CPI) for 22 countries (Cavallo and Rigobon, 2016).6

Statistics Netherlands has been able to use big data sources to create national statistics. It has drawn on data from road sensors for transportation and traffic statistics (Puts et al., 2016). Due to the large number of sensors detecting vehicles in about 20,000 highway loops, Statistics Netherlands is able to collect around 230 million records a day. The data are anonymous—the sensors do not record identifiable information, such as license plate numbers—but the data do allow for estimates of what kind of vehicle was observed based on the vehicle’s length traveling over the sensor, when vehicles entered and exited highways, and the time of day of the observation. After receiving the data and transforming them, the data are

___________________

5 The countries in this definition are those that are members of OECD: Australia, Austria, Belgium, Canada, Chile, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Israel, Italy, Japan, Korea, Latvia, Luxembourg, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Turkey, the United Kingdom, and the United States.

6 The 22 countries are Argentina, Australia, Brazil, Canada, Chile, China, Colombia, France, Germany, Greece, Ireland, Italy, Japan, Korea, Netherlands, Russia, South Africa, Spain, Turkey, the United Kingdom, the United States, and Uruguay. See http://www.pricestats.com/inflation-series?chart=1836 [November 2016].

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

cleaned and adjusted for any possible errors—for example, if a sensor was not functioning properly—and estimates are created for the total number of vehicles on the highways. These estimates can be produced extremely quickly if needed. In early 2016, the Netherlands experienced glazed frost, and Statistics Netherlands was able to produce estimates of how the glazed frost had affected traffic within 2 days.7

Another example of using high-dimensional data for national statistics comes from a partnership with private-sector mobile phone service pro-

___________________

7 See http://nos.nl/artikel/2079372-helft-minder-verkeer-door-ijzel.html [November 2016].

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

viders. Ahas and colleagues (2011) created estimates of tourism statistics in Estonia using GPS-based data from mobile phones. Private-sector data sources are also being actively evaluated to provide new indicators of sustainability, especially for developing countries (U.N. Global Pulse, 2016). In fact, economists have used luminosity from satellite images as an estimator of gross domestic product (GDP) in developing countries (Chen and Nordhaus, 2010). Using 1° × 1° grid-cells that examine luminosity could provide important information on such factors as economic output where there is a lack of population or economic statistics, particularly from war-torn countries. However, luminosity has very little value added for countries that have high-quality statistical systems (Chen and Nordhaus, 2010).

Other emerging uses of high-dimensional data combine them with more traditional statistics created by government statistical agencies. Marchetti and colleagues (2015) created estimates of poverty for small areas by blending mobile phone data with other data from the national statistical office in Italy. Statistics Canada (2016b) has begun to use satellite imagery data as an input for agricultural statistics, replacing a survey. Chessa (2016) used retail outlet scanner data to cover a part of the product prices needed for the CPI. The Colombia National Statistics Office (2016) reported blending satellite digital images to improve land use statistics and land coverage statistics. The U.N. Global Pulse (2014) explored using transformed Twitter data to provide real-time food pricing estimates. Daas and Puts (2014) blended social media sentiment data with traditional data sources to measure consumer confidence.

Many big data projects are currently in pilot project phases, including such projects as use by the Australian Bureau of Statistics (ABS) of satellite surface reflectance data to classify crop type and estimate crop production. ABS is still trying to work out many important challenges such as ensuring reliability of the image data over time, aligning data to statistical boundaries, determining proper level of granularity for the data, and identifying the most accurate statistical methods for estimating quantities of interest (Australian Bureau of Statistics, 2015).

In the United States, a number of federal statistical agencies have been exploring and researching private-sector data sources, such as credit card transactions, other information from commercial providers, and information from Internet sources. Some federal statistical agencies are blending private-sector high-dimensional data with traditional data sources. The Bureau of Justice Statistics (BJS) is currently running a pilot project that is web-scraping data from online articles in order to try to improve estimates for arrest-related deaths (see Box 4-2). BLS currently uses scanner data as part of the input for its CPI estimates (see Horrigan, 2013).

Other federal statistical agencies are using private-sector sources to augment information that could be obtained through surveys. The Eco-

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

nomic Research Service (ERS) has purchased Nielsen and IRI scanner data, which is linked with individual household details, including demographic characteristics of residents, purchases, and prices. The information can be further linked with other geospatial and store characteristic data to get a more complete picture of the food environment for households.

CONCLUSION 4-1 Enormous amounts of private-sector data that are being generated every day have the potential to improve the timeliness and detail of national statistics.

RECOMMENDATION 4-1 Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits of using private-sector data sources.

CHALLENGES TO USING PRIVATE-SECTOR DATA SOURCES FOR FEDERAL STATISTICS

Given the many different data types shown in Table 4-1 (above) and the many different potential private sources for these data, there are similarly a wide range of challenges for agencies seeking to acquire and use those data for federal statistics. Although these data sources hold the potential to add value to official statistics, there are many methodological and logistical issues that would need to be addressed before that potential can be realized. In this section we discuss two of the major challenges—access and quality of the data—and we will explore them more fully in our second report.

Access

The approaches for accessing private-sector data resources are different from those for government-owned data. As noted in Chapter 3, U.S. federal statistical agencies typically develop a bilateral memorandum of understanding or an interagency agreement to codify the terms under which data can be shared between them. However, asking companies to share their data with federal agencies does not start from the same basic trust or common mission that exists among agencies. Although some companies publish their data and allow free access (e.g., Twitter), other companies sell data services and technology platforms. Companies may be reluctant to share their data for several reasons (Groves, 2013): (1) being liable for possible data breaches if their data are linked with government records; (2) increased attention to confidentiality issues and the data private firms have been collecting without much notice from the public; and (3) the possibility that other companies could use their data to create a

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

profitable product, which they would be unable to capitalize on due to the collaborative agreement.

For companies that are willing to provide data, several approaches are possible. As noted above, ERS has purchased Nielsen and IRI scanner data for food policy research. And as described in Chapter 3, the Department of Housing and Urban Development purchased state and local county tax assessment data from Corelogic, which is a private-sector firm that aggregates these data from local sources and sells them to interested parties. Statistical agencies can also form public-private partnerships with private firms to obtain access to their data. Public-private partnerships are defined as a voluntary collaborative agreement between the public and private sectors. These partnerships are distinguished from other forms of public-private cooperation in that the partnership agreements contain defined roles, responsibilities, and rights and are typically characterized by long terms because of the need for longitudinal data (Robin et al., 2016). Data from private companies normally include information from data collection, including either active (survey) or passive (web-scraping) methods; administrative and similar data used for billing customers and targeting services; and transactional data.

Public-private partnerships are typically implemented through long-term contracts. There are four main approaches for access to and use of the data in public-private partnerships: (1) the company providing the data analyzes the data internally and then shares the relevant statistics with the agency; (2) the company transfers the data to the agency for the agency to compute the statistics; (3) the data are transferred to a trusted third party for analysis, and (4) the statistical agency’s functions, including data collection and processing, are outsourced to the private firm.

An example of the first type of partnership was used in Mexico where Telefónica analyzed its call detail records in order to assess the effectiveness of public health alerts for the spread of infectious diseases (Frias-Martinez and Frias-Martinez, 2012). Telefónica compared call detail records in the area of a health alert to a hypothetical model where no alert was given for the same area. Thus, by looking at the difference in mobility between the hypothetical model without health alerts and actual mobility with the health alerts, Telefónica was able to gather information about the effectiveness in reduction of infectious diseases due to health alerts, which it subsequently shared with public agencies.

In the second approach listed above, transfer of datasets is a sharing agreement that involves the physical transfer of databases to the statistical agency under a strict protocol that clearly specifies the terms and conditions and includes each party’s responsibilities and penalties for not following the agreement. BLS is currently negotiating with some large companies to provide payroll and other internal company data from which BLS will extract

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

relevant information, rather than asking the company to complete its surveys. Although one advantage of this type of agreement is that statistical offices can analyze the data themselves, it is important to note that many agencies may not have the internal capacity to work with private-sector data (Robin et al., 2016).

The third approach listed above is the transfer to a third party to analyze the data from the provider to give to the statistical agency. There is an example of this type of partnership in Estonia, where the national statistical office formed a public-private partnership to create travel statistics based on cell phone call detail records through the analytics company Positium and the central Bank of Estonia, Eesti Pank. Positium has been working with mobile network operators for more than 10 years and has demonstrated that it is a trusted third party between the Estonian national statistical system and the telecom providers. Positium manages important concerns in using the detailed records, including preservation of business secrets, protection of users’ privacy, and compliance with privacy legislation. It also offers benefits to the Estonian statistical system, as it possess the technical ability to safely handle data provided by the mobile network operators in its private servers (Robin et al., 2016).

The last approach listed above, outsourcing a statistical agency’s functions, can be described as a process in which activities conducted by statistical offices are outsourced to a contractor. This approach is usually adopted for efficiency. It can include traditional collected data as well as nonofficial data sources that are freely available. This approach is quite common for U.S. federal statistical agencies: of the $7.4 billion spent annually on statistical activities across the federal government, approximately $1.5 billion was designated for private contractors in fiscal 2016 (U.S. Office of Management and Budget, 2015b). Often this work involves survey data collection, but it may also include such activities as frame development, sample design, analysis, and report preparation.

Public-private partnerships offer a number of potential benefits to statistical agencies in that they permit access to private data sources, but there are also important risks and challenges in using those sources. Most of the private data provided in some form to statistical offices from public-private partnerships contain important business data about a firm’s customers and strategy that could have negative effects for the data provider if accidently released or breached. Privacy and ethical issues are also important to consider in public-private partnerships as data often contain personally identifiable information, which is information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other information that is linked or linkable to a specific individual.8 In addition,

___________________

8 See OMB circular A-130, p. 33; available: https://obamawhitehouse.archives.gov/sites/default/files/omb/assets/OMB/circulars/a130/a130revised.pdf [February 2017].

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

a firm’s customers or clients can be extremely sensitive about other uses of their data. For example, mobile network operators may be concerned that some customers will change providers simply on the basis that mobile network operators are holding their call data records (Infas, 2010).

Finally, incentives and sustainability for both parties need to be considered. Even if there are short-term benefits for both parties, long-run costs may become an issue as new methods of data collection become available that lead to issues of compatibility and completeness for longitudinal datasets. Moreover, statistical agencies may fear becoming dependent on an outside provider who may discontinue providing data at any time or could raise prices when it becomes clear an agency has no other source for the data.

From a company’s perspective, there are two primary access issues to consider: the privacy and confidentiality of the data and the profit objective, which come into play in different ways depending on the arrangement between the firm and the statistical agency. If a company has individual credit card data that could be used to assist in the construction of statistical measures, such as GDP or retail sales in the United States, the firm could work with the statistical agency in a couple of different ways with likely different implications. One possibility would be for the private firm and the statistical agency to jointly develop an index, which the company would sell to the statistical office. Privacy concerns would be minimized by providing aggregate statistics to the agency, but there could be implications for potential profits because such an index would also have value to others in the private sector. The statistical office would likely be unable to compensate the private firm sufficiently to keep it from also selling the index to other companies in the private sector.9

The second possibility is for a company to sell its raw credit card data to the statistical agency to analyze and combine with the agency’s other information. In this approach, the company and the statistical agency could then each develop their own separate indexes, and the company could sell its index to others without necessarily revealing the same information the statistical agency would publish. However, in this case, the firm would be very concerned about risks to the privacy of its clients and losing control of its data.10 We discuss issues of privacy and data security in detail in the next chapter, but the main point here is that a company’s privacy concerns and profit objective collide and make the form of engagement with a sta-

___________________

9 There is also the potential issue of prerelease ownership and access. If early access to the statistics is potentially of value (e.g., to investors), then loss of control over release could be a disadvantage: that is, there could be a risk that the private partner could profit from sharing the statistics before their official release.

10 Using secure multiparty computing platforms, which we note in Chapter 5 and will discuss further in report 2, may address these concerns.

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

tistical agency complicated. There is likely no simple solution, but greater engagement between statistical offices and the private sector will be needed to try to meet the challenge.

Data Quality

We began this chapter noting a wide range of domains in which alternative data sources have the potential to contribute to federal statistics, but these sources are not typically simple substitutes for federal surveys, and careful evaluations of quality are needed. Google Flu Trends was designed to predict influenza incidence reports from the Centers for Disease Control and Prevention (CDC), but it represents a cautionary tale in the use of private data sources for national statistics. Although it performed well initially, in early 2013 Google Flu Trends was predicting nearly twice as many doctor visits due to influenza-like illnesses than the actual number of visits collected by the CDC (Lazer et al., 2014) (see Box 4-3). Other examples

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

have shown how biased data lead to serious problems in prediction models (see Lum and Isaac, 2016).

High-dimensional data sources present a variety of other quality challenges for statistical uses, including coverage of the population and measurement issues. In terms of coverage of the population, there are often concerns about sample bias with these data sources, in part because such data often exist only for the “haves” and not the “have-nots.” In addition, social media data on Twitter, for example, are available only for those who choose to use the application (Couper, 2013). Measurement issues also arise with these data sources because, unlike a carefully designed and tested survey question, social media and some other data often are collected without a set stimulus. Similarly, it is difficult to determine how much of a social media post reflects someone’s “true” values and beliefs (Couper, 2013). Even seemingly objective and straightforward scanner data can be fraught with measurement issues (see Box 4-4).

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

There have been some discussions on how to possibly address these issues (see, e.g., Struijs and Daas, 2014). It may be possible to create weights to reduce coverage bias based on the information that users provide in their social media profiles, which can include useful information about age, gender, or social group. However, considerably more research is needed in this area.

CONCLUSION 4-2 The data from private-sector sources vary in their fitness for use in national statistics. Systematic research is necessary to evaluate the quality, stability, and reliability of data from each of these alternative data sources currently held by private entities for their intended use.

We discuss fitness for use further in Chapter 6, and we will discuss quality frameworks evaluating fitness for use in our second report. Because of the many sources and potential challenges with private-sector data, as well as the limited resources of the federal statistical agencies, it is necessary that this research be conducted as efficiently and effectively as possible. We note in Chapter 2 that the Interagency Council on Statistical Policy assists OMB in coordinating the federal statistical system. Since this council is composed of the heads of the principal statistical agencies, it is the logical entity, along with OMB, to oversee the development and implementation of such a research agenda by the agencies in a collaborative and complementary manner.

RECOMMENDATION 4-2 The Federal Interagency Council on Statistical Policy should urge the study of private-sector data and evaluate both their potential to enhance the quality of statistical products and the risks of their use. Federal statistical agencies should provide annual public reports of these activities.

We provide some additional discussion of data quality issues for alternative data sources in Chapter 6, and the panel will address this issue more deeply in its second report. Although evaluation of specific data sources is best done at the program level, there is a need across the decentralized federal statistical system for greater leveraging of limited resources for research and development of new methods and assessing the quality of data from new sources. Sustainable access to these data sources is fundamental for federal statistical agencies to make progress in evaluating the quality and usefulness of these data sources for federal statistics. Hence, we end this chapter with key questions facing the future use of high-dimensional data for federal statistics:

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
  • Can the United States develop a sustainable mechanism and environment to permit federal statistical agency access to private-sector high-dimensional data for statistical purposes?
  • If such access is sustained, how can the quality of these data sources be evaluated for the benefit of all statistical uses of the data?
  • If such access is sustained, how can federal statistical agencies detect changes in the data created by the data holders, which may affect statistical estimates?
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×

This page intentionally left blank.

Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page55
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page56
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page57
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page58
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page59
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page60
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page61
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page62
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page63
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page64
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page65
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page66
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page67
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page68
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page69
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page70
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page71
Suggested Citation:"4 Using Private-Sector Data for Federal Statistics." National Academies of Sciences, Engineering, and Medicine. 2017. Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Washington, DC: The National Academies Press. doi: 10.17226/24652.
×
Page72
Next: 5 Protecting Privacy and Confidentiality While Providing Access to Data for Research Use »
Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy Get This Book
×
Buy Paperback | $58.00 Buy Ebook | $46.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Federal government statistics provide critical information to the country and serve a key role in a democracy. For decades, sample surveys with instruments carefully designed for particular data needs have been one of the primary methods for collecting data for federal statistics. However, the costs of conducting such surveys have been increasing while response rates have been declining, and many surveys are not able to fulfill growing demands for more timely information and for more detailed information at state and local levels.

Innovations in Federal Statistics examines the opportunities and risks of using government administrative and private sector data sources to foster a paradigm shift in federal statistical programs that would combine diverse data sources in a secure manner to enhance federal statistics. This first publication of a two-part series discusses the challenges faced by the federal statistical system and the foundational elements needed for a new paradigm.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!