Skip to main content

Currently Skimming:

5 Preserving Privacy Using Technology from Computer Science, Statistical Methods, and Administrative Procedures
Pages 79-108

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 79...
... As in the previous chapter, we focus only on individual privacy breaches to the data providers: that is, breaches that could not have occurred if the individual's or the organization's data had not been in the confidential dataset. Inferences based on facts about the population as a whole are not viewed as individual privacy breaches.
From page 80...
... An individual privacy breach occurs when one can learn something about a participant in the study that cannot be learned about someone not in the study. For example, if the holders of the medical data were to release poorly "anonymized" individual medical records of the study participants and if it was possible to determine from the records, say, information about the date in which someone entered the study, it might be possible to learn specific details about the person's medical history by tracing individual records back to the corresponding participants or through more mathematically subtle means.
From page 81...
... It is important to note that these individual privacy threats persist even if all security threats are eliminated. Security Threats Securing Data Many security threats can be addressed by maintaining data in a secure form -- partitioned or encrypted -- which includes protection against data loss.
From page 82...
... Technology for secure multiparty computing could, in some situations, even permit a statistical agency to compute the desired aggregate result without ever actually learning all the detailed data in each of the data sources. Securing Computation As we describe in Chapter 3, statistical analyses using multiple data sources can be accomplished in a wide variety of computing architectures.
From page 83...
... Cryptographic protocols of this type are instances of secure multiparty computation, also known as secure function evaluation. To study or implement secure multiparty computation, one needs to consider three factors: 1.
From page 84...
... Using current statistical techniques, analysts often examine data, for example, to identify outliers and data errors. If and when these processes can be described in algorithmic terms, they too could be carried out using advanced cryptographic techniques.
From page 85...
... For example, one may have a large amount of sensitive data, such as genomic data, stored on the untrusted server. Even if the data are encrypted, the server can observe data access patterns, potentially allowing the server to infer which medical test is being applied to the genomic data.
From page 86...
... 28) note that the formal guidance agencies are given for analyzing and mitigating related information security risks is "voluminous, proscriptive, specific, actionable, frequently updated, and integrative into systems of legal audit and certification." In contrast, the guidance for identifying and mitigating individual privacy risks is "general, abstract, infrequently updated, and self-directed," which can often lead to "inconsistent identification of privacy risks and ineffective application of privacy safeguards" (p.
From page 87...
... Inferences based on facts about the population as a whole are not viewed as individual privacy breaches. Different adversarial goals may require different resources, so to fully specify an attack, one also has to specify the resources to which the adversary has access.
From page 88...
... depends on the data of a specific individual in the dataset, which has the potential to result in an individual privacy breach when combined with other sources of information. Finally, data analysis results in observable actions, such as the publication of statistics and technical papers describing the findings.
From page 89...
... The queries may be specified ahead of time, for example, when a government agency decides on a set of tables to release and the statistical estimates themselves have added noise. Alternatively, given a specified set of quantities of the dataset to be revealed -- such as means and variances -- while preserving privacy, synthetic data may be generated whose approximate means and variances (and other prespecified quantities)
From page 90...
... . "Overly accurate" means having error on the order of the square root of N, the size of the set under attack.5 In fact, when n is very small, it is possible to launch a simple attack requiring 2n queries; in this case, the attack works even if the noise is on the order of n itself.6 For example, 11 billion estimates are produced from the American Community Survey; it is worth considering the possibility that these estimates can be used to carry out a reconstruction attack on some portion of the respondents.
From page 91...
... Specifically, we review the major approaches that statistical agencies have used, including applying statistical disclosure limitation methods and perturbing the data and creating synthetic datasets for analysis, as well as noting the weaknesses of these approaches. We also review different approaches that limit access to data depending on its sensitivity and that require researchers to go to data enclaves in order to analyze the data.
From page 92...
... , the choice of category coarsenings may reveal information about the database. That is, had the data been different, the coarsenings chosen would have been different, thus the choice itself may constitute an individual privacy breach.
From page 93...
... Furthermore, in order to obtain estimates of the added uncertainty of statistical values arising from the synthetic generation of data, multiple synthetic datasets are often generated, complicating the statistical estimation and, potentially, increasing risk of privacy loss. Weaknesses of Statistical Disclosure Limitation Methods Although the methods discussed may protect the data against some attacks, they cannot be guaranteed to protect the privacy of respondents, and they all negatively affect the utility of the data for researchers (see, e.g., Reiter, 2012)
From page 94...
... In addition, synthetic data cannot be used to discover any information not in the model, and it might therefore be questioned why one would not just publish the model. Approaches for Tiered Access In addition to the methods described above, the other major approach statistical agencies use to protect the privacy of data is to restrict access to and use of the data with legally binding contracts and administrative procedures.
From page 95...
... There are multiple methods to determine at what level data should be tagged. One possibility includes having an interview with a panel, such as an Insti tutional Review Board, which includes privacy experts.
From page 96...
... The most sensitive data would be held in enclaves and require strict data logs, as major harms could result if the person was identified. Data Enclaves Federal statistical agencies provide access to some microdata files through the Federal Statistical Research Data Centers (FSRDCs)
From page 97...
... Differentially private algorithms ensure that the only possible harms are group harms: the outcome of any analysis is almost equally likely independent of whether any individual joins, or refrains from joining, the dataset. If the same output occurs whether or not a particular person is in the dataset, then that person cannot suffer any privacy loss.
From page 98...
... TABLE 5-1  Privacy Controls over Data Life Cycle 98 Life Cycle Stage Procedural Economic Educational Legal Technical Collection/ Collection Collection fees; Consent education; Data minimization; Computable policy Acceptance limitation; data markets for personal transparency; notice; notice and consent; minimization; data; property right nutrition labels; purpose specification data protection assignment public education; officer; Institutional privacy icons Review Boards; notice and consent procedures; purpose specification; privacy impact assessments Transformation Process for Metadata; Right to correct or Aggregate statistics; correction transparency amend; safe harbor computable policy; de-identification contingency tables; standards data visualizations; differentially private data summaries; redaction; statistical disclosure limitation techniques; synthetic data Retention Audits; controlled Data asset registries; Breach reporting Computable policy; backups; purpose notice; transparency requirements; encryption; key specification; data retention management (and secret security assessments; and destruction sharing) ; federated tethering requirements; databases; personal data integrity and stories accuracy requirements
From page 99...
... interactive query specification; systems; secure registration; multiparty computation restrictions on use by data controller; risk assessment Post-Access Audit procedures; Fines Privacy dashboard; Civil and criminal Computable policy; (audit, review) ethical codes; transparency penalties; data use immutable audit logs; tethering agreements (terms personal data stores of service)
From page 100...
... Differentially private algorithms ensure that the probability of any event is essentially equally likely when the algorithm is running on adjacent databases. "Essentially equally likely" is measured by a privacy loss parameter, usually denoted as epsilon, e.
From page 101...
... Safe outputs: After using data, researchers cannot take any results out of the safe settings without statistical disclosure techniques being applied to the output to reduce the risk of re-identification. All of the five safes aim to maximize researcher access to important data while preventing certain forms of individual privacy breaches or threats.
From page 102...
... It also provides a formal measure of privacy loss and a calculus for computing how privacy losses compound over multiple data analyses. In addition, it yields a collection of simple privacy-preserving building blocks that can be combined to yield differentially private versions of sophisticated analytical techniques from statistics and machine learning.
From page 103...
... Unlike other techniques, differential privacy provides the tools needed to measure and bound the cumulative risk as the data are used and reused, permitting the informed enforcement of a privacy "budget." In other words, all systems are eventually in trouble: differential privacy allows one to monitor the trouble and decide when to stop. Conceptually, there is no difference between multiple releases of synthetic datasets and interactive query systems (sometimes the query is simply "give me a synthetic dataset capturing the following attributes")
From page 104...
... Statistical agencies currently use a number of statistical disclosure methods to protect the confidentiality of their data; however, these methods are increasingly susceptible to privacy breaches given the proliferation of external data sources and the availability of high-powered computing that could enable inferences about people or entities in a dataset,
From page 105...
... 90) As federal statistical agencies move forward with linking multiple data sets, they must simultaneously address quantifying and controlling the risk of privacy loss.
From page 106...
... Pilot studies or test cases will be valuable in identifying the variety of issues that affect agencies and the users of their data, including effects on timeliness of production, the scope of statistical products produced, the utility of the resulting estimates, and the usability of microdata by external researchers. Comparisons will need to be made using agencies' current procedures with state-of-the-art differentially private algorithms and various levels of epsilon from a variety of federal statistical datasets to evaluate the effects on accuracy of results and utility of the resulting data.
From page 107...
... Note, however, that even synthetic data are subject to the fundamental law of information recovery: if a synthetic dataset permits overly accurate estimates of too many statistics, then privacy could be destroyed. For this reason, query-response systems may deliver more accuracy for queries actually of interest than what can be achieved using synthetic datasets, at the same level (epsilon)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.