Skip to main content

Currently Skimming:


Pages 71-97

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 71...
... The federal statistical agencies should retain, preserve, and make accessible machine- and human-readable metadata -- including survey instru­ments and the provenance of any administrative data -- used in the production of official statistics. In addition, because paradata help to provide a better understanding of the quality of survey data, the federal statistical agencies should retain, preserve, and make accessible both machine- and humanreadable paradata necessary for evaluating data quality.
From page 73...
... Regarding the quality of inputs to official statistics, it is becoming more common for federal statistical agencies to use alternatives to survey data as inputs to statistics production. The primary alternatives are administrative data, from federal and state programs, and digital trace data from information stored on the Internet and through other technological means 73
From page 74...
... The third topic addresses transparency in data dissemination and involves the extent to which the federal statistical agencies interact with users to find out what they would like to know about the production of a set of official statistics so that they may be best used. This includes what they would like to know about the data collection processes, the data treatment, the estimation processes, and the validation carried out on the official
From page 75...
... This is a somewhat more general notion, because it can include additional considerations, such as relevance and timeliness.2 However, even in this somewhat more general approach, estimating the biases and variances of the inputs going into the production of a set of official statistics is key. In recent years, due to the higher costs of collecting survey data, primarily as a result of the increasing rates of unit nonresponse, other sources of data are increasingly being used in the production of official statistics.3 In particular, national statistical offices have increasingly used administrative data to produce official statistics.
From page 76...
... As mentioned, in addition to administrative data, federal statistical agencies are also considering -- and currently use to a modest extent -- digital trace data in the production of official statistics. One example is that the Australian Bureau of Statistics is using supermarket scanner data (since 2011)
From page 77...
... .7 The problem in doing this is that given the different origins of administrative data and digital trace data, it is not clear what might be meant by an analog to total survey error.8 In addition to the lack of clarity as to what the component parts are that need to be measured, or how to measure them, it may not be known whether or how such information should be combined. How might one proceed?
From page 78...
... Administrative Data: Estimating Standard Error It has now become standard practice for federal statistical agencies to use estimates of the standard error of survey-based official statistics, often in the form of coefficients of variation, for standard aggregates, for example, weighted means and sums. Given that considerable variability is often attributable to nonresponse (both unit and item)
From page 79...
... . However, it is now becoming common, as in the case of the Small Area Income and Poverty Estimates program, for some of the predictors used by federal statistical agencies in their models to come from administrative data.
From page 80...
... . Web-scraped Data: Quality and Other Issues If the primary dataset used to develop a set of official statistics is scraped or otherwise collected from the Internet, the analogy regarding total survey error in association with survey data is much less clear even than that for administrative data.
From page 81...
... If one is estimating regression coefficients for a model-based estimate that are assumed to be constant throughout the population, and one is using digital trace data for one or more predictors, so long as the dependent variable being fit is from high-quality frame data one can assess and understand the error of such an estimate. On the other hand, if one is trying to estimate a population parameter using only digital trace data, for example using some type of ratio estimate, the error properties of the estimate may be more difficult to assess.
From page 82...
... ­Currently, most data collection, processing, and transformation is con ducted using software, although some manually constructed survey instru ments, manual entry of paper forms, and expert input to data cleaning may also be used. This section primarily discusses software tools that assist in making the software code that conducts these actions transparent, but will also elaborate on how to make manual steps as transparent as possible.
From page 83...
... The concepts and tools related to a version control system -- including code review, continuous integration and testing, and collaboration and issue tracking -- are not magic solutions but require discipline and good
From page 84...
... The LEHD code also successfully avoids use of any hard-coded but secret parameters, a feature that is incorporated into each code review of released code, making it technically possible to easily publish the code. Given the focus on data for federal statistical agencies, we focus on tools that have been developed in the area where software and data inter­ sect.
From page 85...
... Many other statistical agencies use Github for similar purposes. To evaluate the role these tools might play in the future in the federal statistical agencies, there will be a need for greater access to computer science expertise and an examination of best practices in programming and data curation.
From page 86...
... All such documents and practices increase both internal and, when published, external transparency. Many agencies spend many person-hours crafting the latter, but little is known about the use of programming style guides.22 As the Decennial Census's Disclosure Avoidance System has shown, it is possible to develop new systems, including those pertaining to sensitive disclosure avoidance systems, in a public and transparent manner.
From page 87...
... In certain secure environments within statistical agencies, all use of the software generates a logfile for audit purposes, but in the context of transparency and reproducibility, simpler logfiles may be sufficient. Compared to the data being generated, logfiles are generally much sparser and thus easier to archive together with the generated data, acting as a form of metadata, or in some cases, paradata.
From page 88...
... Recommendation 4-2: To facilitate transparency, agencies that produce federal statistics are encouraged to develop coding style guides, and to make available documentation and specifications for software systems, subject to any security concerns. Where possible, code (for example used for data collection or processing)
From page 89...
... Given this, a statistical process is how each of these is achieved within each of the subject-matter areas in each of the federal statistical agencies. Statistical software languages (e.g., SAS, SPSS, Stata, or R)
From page 90...
... More broadly, in order to understand and meet user needs in data products, documentation, dissemination systems, and archiving, agencies must develop mechanisms to solicit more frequent input from their user community and facilitate ongoing dialogue with them. A number of the federal statistical agencies have given limited effort to understanding what their users need in terms of transparency, accessibility, and usability of data products to enable optimal use of official estimates and associated input datasets.
From page 91...
... 2. Statistical agencies could survey their user communities to solicit specific input before changes are made to data collection tech niques, estimates, data products, dissemination systems, or Web pages to ensure that data users' needs will still be met after pro posed changes are implemented; and they could also involve mem bers of the user community in reviewing and providing feedback as these changes are actually implemented.
From page 92...
... In NCSES, such reports are often available internally in draft form, but either they are never reviewed and therefore are not made publicly available, or they are viewed as being too technical for public release. We understand that the demand for such documents may be limited, but for a small subset of the user community such documents can be extremely important for providing detailed information to researchers on how data treatments were implemented and how estimates were produced, and in 25 https://www.ire.org/resources/listservs/.
From page 93...
... This will create a real-time opportunity for agencies to see how their data are being used, which in turn will help them become more responsive. This helps all statistical agencies meet a requirement in the Evidence Act to get feedback from the public on the utility of their data.
From page 95...
... For an input dataset, this necessitates providing the units of analysis as defined by the rows, and the variables defining the columns, including what the questions and transformations under­lying each variable are and what the various responses are and what they mean. If the dataset contains output estimates, again what the rows and columns signify needs to be stated.
From page 96...
... In the statistical community, efforts to begin the management and use of metadata to describe data, datasets, and the methodology used to create them began in the 1970s. In the 1980s the efforts expanded to include research data libraries, data archives, and national statistical offices; and they expanded even more widely in the 1990s as the online digital revolution exploded.
From page 97...
... Metadata stored in formal databases as numbers, codes, or entries from controlled vocabularies are designed to be machine readable; they are ­active when used to control the execution of some system in a particular way. In this report, statistical metadata (hereafter, metadata)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.