Skip to main content

Currently Skimming:

4 Assessments of Quality, Methods for Retaining and Reusing Code, and Facilitating Interaction with Users
Pages 73-94

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 73...
... Regarding the quality of inputs to official statistics, it is becoming more common for federal statistical agencies to use alternatives to survey data as inputs to statistics production. The primary alternatives are administrative data, from federal and state programs, and digital trace data from information stored on the Internet and through other technological means 73
From page 74...
... However, the literature to date on how to assess the quality of administrative data, or digital trace data, is not fully developed, certainly not to the extent that it is for sample survey data. The associated transparency issue, then, is what should be retained when using administrative or digital trace data, to make whatever quality assessments one needs to make in order to permit use of such an approach to estimation.
From page 75...
... Administrative data are collected as a byproduct of the administration of a governmental program, often by collecting information to determine eligibility for the program, the size of the benefit, and information to help distribute the associated benefits. In addition to administrative data, also under consideration for use by statistical agencies are data collected from the Internet and other "technology" sources, including transaction data, social media entries, and sensor data, which is referred to here collectively as digital trace data.
From page 76...
... As mentioned, in addition to administrative data, federal statistical agencies are also considering -- and currently use to a modest extent -- digital trace data in the production of official statistics. One example is that the Australian Bureau of Statistics is using supermarket scanner data (since 2011)
From page 77...
... .7 The problem in doing this is that given the different origins of administrative data and digital trace data, it is not clear what might be meant by an analog to total survey error.8 In addition to the lack of clarity as to what the component parts are that need to be measured, or how to measure them, it may not be known whether or how such information should be combined. How might one proceed?
From page 78...
... Administrative Data: Estimating Standard Error It has now become standard practice for federal statistical agencies to use estimates of the standard error of survey-based official statistics, often in the form of coefficients of variation, for standard aggregates, for example, weighted means and sums. Given that considerable variability is often attributable to nonresponse (both unit and item)
From page 79...
... . However, it is now becoming common, as in the case of the Small Area Income and Poverty Estimates program, for some of the predictors used by federal statistical agencies in their models to come from administrative data.
From page 80...
... . Web-scraped Data: Quality and Other Issues If the primary dataset used to develop a set of official statistics is scraped or otherwise collected from the Internet, the analogy regarding total survey error in association with survey data is much less clear even than that for administrative data.
From page 81...
... If one is estimating regression coefficients for a model-based estimate that are assumed to be constant throughout the population, and one is using digital trace data for one or more predictors, so long as the dependent variable being fit is from high-quality frame data one can assess and understand the error of such an estimate. On the other hand, if one is trying to estimate a population parameter using only digital trace data, for example using some type of ratio estimate, the error properties of the estimate may be more difficult to assess.
From page 82...
... ­Currently, most data collection, processing, and transformation is con ducted using software, although some manually constructed survey instru ments, manual entry of paper forms, and expert input to data cleaning may also be used. This section primarily discusses software tools that assist in making the software code that conducts these actions transparent, but will also elaborate on how to make manual steps as transparent as possible.
From page 83...
... Such release management increases transparency, as all decisions -- starting from ad­ eveloper's change to feedback by the reviewers, and the tests that have been run by the quality team -- are all fully documented by the associated tools. Further, in case any issues arise, this kind of release management makes it easy to track how and why every change was made.
From page 84...
... The LEHD code also successfully avoids use of any hard-coded but secret parameters, a feature that is incorporated into each code review of released code, making it technically possible to easily publish the code. Given the focus on data for federal statistical agencies, we focus on tools that have been developed in the area where software and data inter­ sect.
From page 85...
... Many other statistical agencies use Github for similar purposes. To evaluate the role these tools might play in the future in the federal statistical agencies, there will be a need for greater access to computer science expertise and an examination of best practices in programming and data curation.
From page 86...
... Code can contain copious amounts of documentation, can be structured to be more easily understandable without documentation in English grammar (using programming style guides) , and can be accompanied by high-level and detailed documentation and software-agnostic specifications.
From page 87...
... In certain secure environments within statistical agencies, all use of the software generates a logfile for audit purposes, but in the context of transparency and reproducibility, simpler logfiles may be sufficient. Compared to the data being generated, logfiles are generally much sparser and thus easier to archive together with the generated data, acting as a form of metadata, or in some cases, paradata.
From page 88...
... Recommendation 4-2: To facilitate transparency, agencies that produce federal statistics are encouraged to develop coding style guides, and to make available documentation and specifications for software systems, subject to any security concerns. Where possible, code (for example used for data collection or processing)
From page 89...
... Given this, a statistical process is how each of these is achieved within each of the subject-matter areas in each of the federal statistical agencies. Statistical software languages (e.g., SAS, SPSS, Stata, or R)
From page 90...
... More broadly, in order to understand and meet user needs in data products, documentation, dissemination systems, and archiving, agencies must develop mechanisms to solicit more frequent input from their user community and facilitate ongoing dialogue with them. A number of the federal statistical agencies have given limited effort to understanding what their users need in terms of transparency, accessibility, and usability of data products to enable optimal use of official estimates and associated input datasets.
From page 91...
... 2. Statistical agencies could survey their user communities to solicit specific input before changes are made to data collection tech niques, estimates, data products, dissemination systems, or Web pages to ensure that data users' needs will still be met after pro posed changes are implemented; and they could also involve mem bers of the user community in reviewing and providing feedback as these changes are actually implemented.
From page 92...
... For example, a long-time NCSES data user might wish to understand what changes have been made to the most recent version of a survey; a journalist might wish to simply download the current value to support her article; a nonNCSES data user might happen onto NCSES data found in data.gov and as a result visit NCSES's Website to search across topics and subtopics to learn more about an issue of interest; or an analyst might want easy access to an i­nternal archived dataset to reproduce a specific statistic to check on computational reproducibility. One must also be cognizant that there will always be users who are comfortable with the current Website and navigation, and they will find that their experience is disrupted with the implementation of a new structure.
From page 93...
... This will create a real-time opportunity for agencies to see how their data are being used, which in turn will help them become more responsive. This helps all statistical agencies meet a requirement in the Evidence Act to get feedback from the public on the utility of their data.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.