Skip to main content

Currently Skimming:

3 Implications of Using Multiple Data Sources for Information Technology Infrastructure and Data Processing
Pages 45-60

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 45...
... We begin with a brief overview of the IT issues that federal statistical agencies face with their current systems. We then review the nature of the architecture that will be needed by statistical agencies and for the panel's recommended new entity.
From page 46...
... federal statistical system requires that every statistical agency has its own IT system, both because of tradition and the laws that authorize the agencies. In recent years, with greater efforts toward centralizing IT systems within departments and the passage of the Federal Information Technology Acquisition Reform Act (P.L.
From page 47...
... . The Census Bureau has embarked on a new census enterprise data collection and processing system in conjunction with the reengineering of the 2020 census: the goal is to attempt to reduce the more than 100 systems it operates for data collection and processing to a single unified approach.1 The Census Bureau is similarly seeking to streamline 30 different applications used to disseminate information into a unified approach by creating a new Center for Enterprise Dissemination Services and Consumer Innovation in the Census Bureau that will centrally disseminate information to application program interfaces as well as interactive web tools, data visualizations, and mapping tools.
From page 48...
... When data from multiple data sources are aggregated into a database, one popular paradigm in the computing industry follows the centralized model described above: the traditional "data warehouse." This was the expected structure of the National Data Center proposed many years ago (Kraus, 2013)
From page 49...
... As in so many aspects of business, it sometimes makes sense to outsource this responsibility to a service provider that has particular expertise in this task. For data processing in particular, this outsourcing has been made particularly easy through the "cloud." The basic idea is that the data centers are owned and managed by service providers that often share the same data center facilities across multiple enterprises.
From page 50...
... The panel noted in its first report that data breaches and identity theft pose risks to the public and that a continuing challenge for federal statistical agencies is to produce data products that safeguard privacy. Even if strong access and data release practices are designed to satisfy privacy requirements, it is difficult to guarantee against a data breach.
From page 51...
... Federal statistical agencies will need to consider the governance, functionality, and flex ibility of the system, as well as the implications for protecting privacy and addressing data providers' concerns regarding privacy. DATA PROCESSING ISSUES Moving to a paradigm of integrating multiple data sources for federal statistics will necessitate a greater focus on data curation by federal statistical agencies, which requires the "processes and activities needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data" (Miller, 2014, p.
From page 52...
... . In many uses of multiple data sources for the federal statistical system, the panel assumes that updates of data sources will be to update statistics.
From page 53...
... Federal statistical agencies are well acquainted with data cleaning and transformation in the context of survey data. A major difference between what they have been doing and what would be required in the envisaged new system is that they currently often build in data cleaning checks at the acquisition stage so that they can collect more accurate information from the household or business respondent directly.
From page 54...
... Current surveys often collect detailed descriptions of jobs and industries, which coders then review and classify into the North American Industry Classification System or standard occupation codes. Federal statistical agencies have been developing and using sophisticated tools to streamline these kinds of coding tasks and will need to develop and apply similar tools with new data sources.
From page 55...
... It is not enough, for example, to record a manual review and error correction step at the level of a dataset: information needs to be recorded on the individual manual corrections applied. The panel recognizes that federal statistical agencies currently have thorough documentation and good metadata, often including paradata, for their surveys.
From page 56...
... This "fitness-for-use" issue is much more complex when using multiple data sources, as multiple data sources often provide more information than traditional single data sources. Frequently, the user may not be interested in the provenance of the entire dataset, but only in a particular value.
From page 57...
... This consideration increases in importance as more complex and more computational processes are used to generate statistical products. CONCLUSION 3-3 Creating statistics using multiple data sources often requires complex methodology to generate even relatively simple statistics.
From page 58...
... Key to the acceptance of the effort is Palantir Technologies' data integration approach, which retains complete data provenance throughout the entire data creation and manipulation cycle. Palantir Technologies supported a variety of core capabilities, including secu rity and auditing mechanisms, collaboration environments, and report generation tools.
From page 59...
... See Box 3-1 for a brief description of a pilot study that illustrates an exemplary migration. RECOMMENDATION 3-1 Because technology changes continuously and understanding those changes is critical for the statistical agencies' products, federal statistical agencies should ensure that their informa tion technology staff receive continuous training to keep pace with these changes.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.