Page 127 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

– 10 –

Panel Discussion on Key Privacy Issues

At the Census Bureau’s request, the workshop planning committee took care to include a dedicated block of time for presenting arguments about the privacy side of the privacy-accuracy balance that motivates the development of disclosure avoidance for the census. A group of privacy researchers was assembled to participate in a panel discussion. After each panelist offered opening remarks, an extended-length floor discussion period was reserved.

Daniel Goroff (Arthur P. Sloan Foundation) moderated the discussion, stating in opening remarks that the workshop presentations on the broad applications of census data were genuinely impressive, conducted with great care and dedication to data quality and naturally concerned about the utility of the census data. He said that the “threats” to the quality and utility of census data were manifold—sampling error, coding error, imputation for missingness, suppression of counts or content—and now, it is understandable that noise injection to preserve privacy might be viewed as a threat. However, he said, “all of those threats might hardly matter” without the foundation of census data collection, which is the “truthful and representative and safe participation” by the responding public. Those respondents will reasonably ask why they should participate in the census and whether they should answer truthfully, but should they decline or dissemble when answering the census, all of the previously mentioned threats are worsened, particularly because refusal or dissembling is very unlikely to be randomly distributed in the population. (In this sense, he said, a critical advantage of differential privacy-based approaches is that the

Page 128 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

main protection it affords—through noise injection—is randomly distributed by design, and “you can see exactly what’s going to happen” with it.)

Goroff said that many people may not feel particularly vulnerable answering the inquiries on the decennial census forms, but there are many reasons why other people would feel threatened: the elderly living alone, undocumented immigrants, residents of public housing, and same-sex couples among them, not to mention people simply receiving disinformation about the census. So he posed the challenge of this workshop as asking: What can we truthfully tell people about participating in the census, and how can we help our fellow residents and our fellow census users understand exactly how data releases can actually protect both utility and privacy by trading some of one against the other?

As illustration, Goroff walked through a brief explanation of the guarantee made by differential privacy, interpreting ϵ in terms of privacy. The formal theory of differential privacy is posited in terms of the difference in information conveyed by “neighboring” datasets x and x^′ that differ by at most one line, a single individual’s data record. An analyst—perhaps an adversary—would have no firm knowledge whether the dataset z they see is equal to x (containing the particular person’s information) or x^′ (omitting the particular person’s information). The lemma displayed by Goroff holds that a query M (z) from an ϵ -differential privacy mechanism can only change the odds ratio Pr(z = x)/Pr(z = x^′)—the analyst’s prior belief about whether the particular individual’s record is in the dataset or not—by a factor between exp(−ϵ) and exp(ϵ). (A corollary is that, for small ϵ, the prior and posterior odds can differ by at most 100×ϵ percent.) In plainer language, exp(ϵ) is a bound on the change in the odds ratio of an individual person record’s inclusion. The analyst’s initial odds ratio of 1 (even odds, 50 : 50) change by about 10 percent if the query is protected by (ϵ = 0.1)-differential privacy and could change to about 20:1 in an (ϵ = 3)-differential privacy framework. That said, Goro noted, “I’m not sure that I would recommend knocking on your neighbor’s door and telling them about e to the ϵ ” but the point is that the concepts can be made less mysterious, and that it can be intuitively explained that differential privacy “just makes it hard to tell that it was you who participated as an individual.”

Goroff closed by noting that the preceding workshop presentations had amply demonstrated that there is some statistical “bias” in the 2010 Demonstration Data Product (DDP) numbers—“it’s not hard to find.” He reiterated that the bias is not due to the differential privacy itself but rather to the post-processing to convert away from negative numbers. He argued that a point that should be discussed further is giving researchers access to the data with the statistically unbiased, negative numbers. He also suggested that the Census Bureau should be open about the fact that it will add random noise to the counts it reports, but explain that “perhaps these are actually small” compared with the “usual”

Page 129 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

threats to data utility like failure to participate in the census at all. With that, he introduced the panelists in turn for their opening comments.

10.1 PRIVACY AND CENSUS PARTICIPATION

danah boyd (Microsoft Research and Data & Society) opened by expressing gratitude to the workshop presenters for their passion in using data to address a variety of changes and noted that she has spent her career “living the other side of the equation,” trying to find remedies to all the ways in which data have been used to magnify longstanding inequities and do serious harm to people. The issues that arise in discussing these tradeoffs between privacy and utility are “going to mostly affect people of color, poor people, other marginalized members of the community”—“and that’s what makes all of this a little awkward,” because the consequences of these tradeoffs can be so grave. Those consequences include voting rights and deportation, denial of access to services, and being kicked out of housing. She said that this technical consultation is so important because, in such a charged atmosphere, the issues need to be considered “from a technical place.”

The tradeoff between privacy and utility began a long time before the particular moment of this workshop, she said. The Census Bureau has long been engaged in different disclosure avoidance routines that have affected the data in ways that are unknown to the public. Trying to figure out what is happening in this moment as far as revising the approach to disclosure avoidance, then, is important to do publicly. She said that her request to census data user stakeholders is to think through all the different layers of this puzzle. boyd said that her role is to “defend privacy,” which she said feels “unfair” in many ways because “I don’t believe I should be speaking on behalf of all of those who have been surveilled and abused by data collection, all of those who were afraid of participating.”

She framed the broader privacy “problem” as involving two challenges, one about individuals’ willingness to participate in the census in the first place (due to fears around privacy and confidentiality) and the second about whether the data could be used (or abused) in a harmful way. On the first point, boyd said that all of the various parties who have studied census outreach—studies by Pew Research, the National Association of Latino Elected and Appointed Officials (NALEO) Educational Fund, or the Census Bureau’s own waves of Census Barriers, Attitudes, and Motivators (CBAMS) studies—have concluded that confidentiality ranks among the most critical factors in determining whether someone will take part in the census or not. The strength of that common finding is really important because those studies also factor in other relevant barriers to participation such as experience, exposure, and marginalization.

Page 130 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

boyd said that what scared her about the 17 percent figure—the confirmed reidentification rate in the simulated database reconstruction attack that the Census Bureau mounted against itself—is that 17 percent may be the minimum bar. Even backtracking to data sources available in the 2010 Census, that number could still increase in the present context, and as someone who has followed the growth of commercial data for several years, boyd said that she was “terrified” of what the threat could be. Her personal sense is that, given what she has seen in different kinds of commercial data, a 40 to 60 percent reidentification rate is possible. She added that we still don’t know what the threat to privacy is because the data to date don’t permit a full and fair evaluation.

As an example of deliberate attempts to stir up fear concerning the census, boyd briefly displayed a dual English-Spanish infographic (taking care to superimpose the word “FAKE” in bold, red type across it) that she said had been circulated publicly. It was disguised as an advisory from the American Civil Liberties Union (ACLU) urging readers to “REMAIN SILENT” and “PROTECT YOUR INFORMATION” by refusing to complete any census form, threatening that the U.S. Immigration and Customs Enforcement “Will Find You and Deport You!” The graphic underscores the severity of the problem, that there are groups at work who actively want to deter census participation, telling potential respondents that their data are going to be used by government officials to directly and meaningfully harm them. To be sure, boyd reiterated that the Census Bureau has built a network of hundreds of thousands of “community partners,” the kind of people and groups who can help override negative attacks like the one communicated by this infographic, but boyd said that the Census Bureau is hard-pressed to override every possible attack like this, particularly with the “huge, extraordinary level of fear” surrounding the political climate of the 2020 Census.

boyd said that she has “always been a big fan” of William O’Hare’s work on tracking undercounts of small children (Section 9.2), and that work is both fascinating and sobering because of the realization that these are often households where some but not all of the residents were counted. The census form arrived or the interview was completed, but the children missing in those households are missing because they were just not listed. Advocates can try to fill in the gaps, but the broader question needs to be resolved: what are all of the reasons why people would not include members in their households? She said that other examples in this vein include mixed-citizenship houses, same-sex couples, and living situations where not everyone’s name is on the lease or the residents may be in violation of local housing regulations. boyd also recounted anecdotes from her own work, talking to people about non-participation. In New York City, she said that there are many people living in Section 8 public housing who are terrified of giving full information to the city (or the census) about who is living in their homes, for fear of being kicked out. These are

Page 131 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

people who are concerned about being subject to constant surveillance: video surveillance in the buildings, concerns about surveillance of electric and water utility usage. They fear that the information that they provide might be directly used against them, but the prospect of being reidentified from public statistical information is terrifying as well. boyd said that in Detroit, she spoke with a group of community advocates, young people who do not want to identify everyone living in their apartments to the City of Detroit because not all the names appear on the leases. Some of these people claim other reasons such as registering their cars “at home” rather than in the city for tax purposes (so why link oneself to the census in the city?), but it would often come back to fear of getting in trouble for violating their lease. Trying to convince people to participate in the presence of genuine fear about participating is difficult.

The second major part of the broader privacy problem is the way in which data may be abused. Briefly, boyd displayed a slide listing a variety of the “commercial data” that is commonly available today: court data (criminal and civil), debt data (credit card or medical debt), health data, insurance data, and vehicle data, not to mention “sketchy” data (such as those generated by tracking in smartphone apps) and illegally obtained, publicly leaked data such as those obtained through breaches at major banks and credit agencies. Census data might only link with these commercial data sources “at statistical levels right now,” but boyd said that this is the whole point: the data ostensibly only for statistical purposes can still be used in harmful ways. These commercial data entities and data brokers want to drill down into personal information—individual elements—and they want to do so to be able to sell that matched data back to various entities. The concern, boyd said, is that those entities buying and using commercial data may include law enforcement, housing providers, employers, and credit brokers. Can we promise people confidentiality, boyd asked, and guarantee that the information about their same-sex partner will not be used to deny them a job in industries where discrimination is still rampant? Can we promise people that different aspects of mixed-race heritage or the configuration of their household won’t get them in trouble with insurers?

She closed by acknowledging the “elephant in this room,” which is the very public concern about citizenship in the 2020 Census. No matter how much we talk about how there is no longer a citizenship question on the 2020 Census form, “we also know that the [Census] Bureau is being dictated to actually figure out a way to provide” that information, which is a terrifying prospect for many people. She closed by noting that we need to remember that we are talking about the most marginalized parts of our community when we talk about privacy, which is why the balance is so critical.

Page 132 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

10.2 SEVERITY OF THE REIDENTIFICATION THREAT

Asked by Goroff to comment about the severity of the reidentification threat, Omer Tene (International Association of Privacy Professionals) indicated that the threat is longstanding and very real. He said that attention started being given to this issue about 20 years ago through Latanya Sweeney’s research. Sweeney’s matching of what were at the time thought to be completely anonymized healthcare data records to local voter registration records and the ability to reidentify individuals from the resulting linked data (most notably William Weld, then the governor of Massachusetts) was a key moment. So too was the linkage attack demonstrated by Arvind Narayanan and Vitaly Shmatikov, who utilized the database of movie ratings by 500,000 subscribers that the streaming service Netflix had released publicly to spur researchers to suggest improvements to Netflix’s recommendation engine. By linking those data (also thought to be anonymous, with personal identifiers stripped from the file) to an auxiliary information source (public ratings on the Internet Movie Database, IMDb), Narayanan and Shmatikov were able to reveal personal information about the original Netflix subscribers. A few years after that, Yves-Alexander de Montjoye and colleagues at MIT demonstrated that four spatiotemporal points—in this case, geolocation “pings” from cellular phones—were enough to identify 95 percent of individuals using mobility trace data. Recently, the Census Bureau itself added to this literature, with Garfinkel et al. (2019) discussing the way in which public census tables may be used to reidentify individuals.

The bottom line, Tene suggested, is that almost any collection of indirect identifiers (even those thought to be anonymized) can create a “fingerprint” by which individuals can be identified or reidentified. A recent paper by Rocher et al. (2019) suggested that more than 99 percent of Americans would be identifiable in any dataset using 15 demographic attributes, and the decennial census publishes information on several of those. Tene noted that, in the workshop session preceding this panel, William O’Hare had closed his remarks by saying that he himself did not feel any great sensitivity about his age, race, and sex being known. That might be true, and decennial census information might not seem incredibly sensitive in itself, but in the presence of very rich commercial datasets and possibilities for data linkage, knowledge of basic census information on age, race, and sex is key to revealing one’s health conditions, prescriptions, purchases, Internet click-stream, locations visited, and more. These are data that even people who are not in vulnerable categories would probably prefer to keep to themselves.

Like boyd, Tene said that he can think of many situations where people would be concerned about even the basic census information becoming known: domestic violence victims who don’t want to be located (but who might if the dataset is reidentified); people who are sensitive about how they self-identify

Page 133 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

in terms of their race, their sex, and their partners; people with an interest in fudging their residence location for some reason such as voting or sending their children to school in some other district. Like boyd, Tene said that the debate over the citizenship question and the inclusion of that information (even if the question is not asked) poses “a very grave issue for undocumented immigrants.”

He closed by noting that concerns over privacy might produce another impact on the utility of the data beyond the tradeoff between privacy and accuracy in disclosure avoidance. Knowledge of reidentification risk may incentivize respondents to do their own disclosure avoidance, providing incorrect answers or failing to answer all questions.

10.3 CONTEXT AND PRIVACY

Goroff introduced Helen Nissenbaum (Cornell Tech) to provide some contextual framework. Nissenbaum noted in her introduction that her doctorate is in philosophy and so was endeavoring to “stay in my lane” and defer to her colleagues’ census-specific knowledge. Instead, she said that the theory on which she has been working on for several years—contextual integrity—views privacy as a lens through which privacy rights can be understood, and privacy’s value to individuals and to society can be evaluated. In many ways, Nissenbaum said, contextual integrity opens up solution spaces that might not have even suggested themselves if privacy was only viewed as a kind of on/off switch.

The theory is based on the concept of information flow, a neutral way of describing the manner in which information or data moves through society. This flow is grounded in five parameters:

(the fifth parameter, <sender>, was shown in the second position on Nissenbaum’s introductory slide but is omitted here because in the examples she presented, <subject> and <sender> are one and the same). She said that she did not have time to elaborate in detail, but clarified that the particularly mysterious <transmission principle> is “the terms under which information is passed from one entity to another.” In this framework, the information flow of income tax returns may be rendered:

<U.S. citizens><Internal Revenue Service><gross annual income><required, confidentiality assured per law>

(“Citizens of the United States are obliged to reveal gross annual income to the Internal Revenue Service under conditions of confidentiality, except as required by law”). Likewise, the basic decennial census problem could be formulated:

Page 134 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

(“Household residents are required to convey to the U.S. Census Bureau answers to questions posed on the census questionnaire with assurances of strict confidentiality regarding legal identity”).

Appropriate information flow follows entrenched norms or rules, and we live with all of these rules in society. Nissenbaum said that the theory of contextual integrity asks why these are the rules and why we have certain resulting expectations about the way data flows. The answer is that these rules are good (and “the ones we want to promote” and defend) because they are legitimate rules: they serve purposes subject to fundamental values. In the census example, the purposes range from apportionment and redistricting to fund allocation, but they also include the general concept of “understanding the nation” and providing useful data to crucial social sectors. A purpose that is expressly forbidden is that the data are not to be used for administrative, individual-level decision making. The fundamental values that the purposes serve include the principle that individual respondents must not be harmed through disclosures of the information they provide and that maintaining trust is essential to securing honest participation.¹

Nissenbaum argued that this framework is useful for framing the census problem, even though the “information type” and “transmission principle” parameters have evolved over time. The first census in 1790 asked just five questions (including age only to the precision of above or below 16) and required that the results be posted in a public place for maximum scrutiny. A growth in content and finer-grained collection of age information through the first half of the 1800s produced to the 32-question ledger used in the 1910 Census (and much tighter restriction on who gets access to the census information). Nissenbaum also noted some “shameful moments with regard to the census” that eroded some of the appropriateness in this information flow structure: the provision of information to the U.S. Department of Justice and the military about draft-age men in World War I and the role of

___________________

¹ As illustration of how long and with what fervor these values have guided U.S. census conduct, Nissenbaum briefly displayed a passage from President Taft’s March 15, 1910, proclamation of the 1910 Census (see https://www.census.gov/2010census/news/pdf/1910_census.pdf), adding emphasis to key points:

The sole purpose of the census is to secure general statistical information regarding the population and resources of the country, and replies are required from individuals only in order to permit the compilation of such general statistics. The census has nothing to do with taxation, with army or jury service, with the compulsion of school attendance, with the regulation of immigration, or with the enforcement of any national, state, or local law or ordinance, nor can any person be harmed in any way by furnishing the information required. There need be no fear that any disclosure will be made regarding any individual person or his affairs. For the due protection of the rights and interests of the persons furnishing information, every employee of the Census Bureau is prohibited, under heavy penalty, from disclosing any information which may thus come to his knowledge.

Page 135 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

Census Bureau data in implementing internment of Japanese Americans in World War II. In those cases, the <recipient> (the Census Bureau) broke its commitment, and in Nissenbaum’s view, undermined fundamental values of privacy, security, and trust. Nissenbaum indicated that the issue is whether the historical “shameful moments” (inappropriately providing raw data to parties who were not entitled) is equivalent to the current situation in which it has been demonstrated that, using aggregate and public products from the 2010 Census, it is possible to recreate individual data records and attach legal identifiers.

Nissenbaum asked rhetorically about the solution space here and suggested that doing nothing to change the disclosure limitation routines used in the census is not the same as “maintaining the status quo.” That status quo is untenable because of the factors already mentioned at the workshop: the new science of big data and the ready access of computing power and auxiliary data. Instead, she said, contextual integrity suggests a number of other areas to look at as a solution space. In terms of the <recipient> and <transmission principle> parameters, tighter limits could be put on “who gets what” and who gets access to the data and products. On <information type>, options include paring back the number and content of questions, coarsening the granularity of categories in the data products, and the implementation of differential privacy–based routines.

10.4 LEGAL PROTECTIONS OF PRIVACY

After having heard the word “tradeoff” invoked many times in the second-day workshop sessions he attended, Paul Ohm (Georgetown Law Center) began by saying that he wanted to disabuse the audience of a fundamental premise flowing through the entire conversation thus far: that our enterprise concerning questions of privacy is finding a perfect calibration between privacy and utility. He argued that there is an important (and possibly dominant) thread in privacy scholarship that casts it as a “strictly utilitarian endeavor,” but he indicated that he wanted to argue from a different position and a very different conception of privacy. Particularly as it concerns discussions about the government, Ohm said that privacy is about fundamental human rights and fundamental notions of respect for other individuals, not a “quaint, airy, philosophical” notion.

Ohm said that he had not heard mention made of the Fair Information Practice Principles (FIPPs) in the workshop discussion. Though often described as a “European conception” of privacy, the FIPPs were first formulated in the United States in the 1960s and early 1970s. Ohm said that the principles were first articulated in response to a growing fear that the U.S. federal government was moving toward the creation of a national data center, a centralized “giant database in the sky” that would house all government information about every American citizen. Ohm said that the concern was that “a lot of very well-

Page 136 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

meaning social scientists” were advocating for this national center based on “the benefits for (a), our research, and (b), the nation,” and this in turn prompted the growth of the privacy movement in the United States. Since that time, Ohm said, the FIPPs have been “encapsulated worldwide,” perhaps most prominently in the European Union and the European Economic Area’s General Data Protection Regulation (GDPR) and the 2018 California Consumer Privacy Act (CCPA). When it comes to private enterprise and corporations, Ohm said that the United States has never fully embraced the FIPPs model: “corporations do run roughshod over individual privacy,” and privacy debates in the corporate sense tend to be basic utilitarian tradeoffs between cost and benefit.

Ohm warned, on the other hand, that it would be a mistake to think that the FIPPs do not permeate the federal laws of the United States. When it comes to privacy debates where the government is concerned, “we have responded differently” as a nation. In particular, the FIPPs were encapsulated in the Privacy Act of 1974, which deals with systems of records collected by government authorities. How effective or meaningful the act’s protections are in practice is a debatable question, but Ohm’s point is that the 1974 law crystallized the idea of “this is what the U.S. government owes its people” and “this is the respect for fundamental human rights that privacy requires.” Ohm contended that the Census Act, Title 13 of the U.S. Code, is another example of the FIPPs undergirding federal law. The Census Act has a promise of confidentiality embedded within it for the instrumental purpose of gaining respondent trust and encouraging their honest participation. Ohm contended that Title 13 also represents the idea that “when we’re talking about the way the U.S. government treats information about every one of us,” then “we need to hold them to a higher standard.”

In short, Ohm argued, he and other privacy scholars don’t view the problem as finding “precisely the exact scalpel to allow exactly the maximal amount of data utility while also protecting this abstract thing called privacy,” nor is that interpretation consistent with his interpretation of the Census Act and the Privacy Act.

Apologizing for the double negative, Ohm echoed Phil Leclerc’s (Section 2.1) and Nissenbaum’s basic point that “the Census Bureau is not allowed to do nothing,” that the status quo is untenable, that the new data processing world is such that doing disclosure avoidance as it has been done in past censuses is a violation of the legal prohibition on providing identifiable information. Ohm commended the Census Bureau for saying as much as they have about the way things were done in the 1970 through 2010 Censuses, including table suppression and data swapping. Ohm said that we should be comparing differential privacy to that prior disclosure limitation regime—a regime that he contended was inherently non-transparent, “where you were trusting the Census Bureau to do right by privacy” while still allowing enough of both privacy protection and utility. His conclusion is that there was always a little

Page 137 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

bit of “black magic” involved in that routine, a bit of a notion that “we’ll know exactly when the data feels like it’s reidentifiable enough” to warrant a swap here or an imputation there.

From Ohm’s perspective, outside researchers and census data users really had only two possible positions they could take in that historical status quo: (1) treat the data as if it’s really accurate, under the assumption that “they’re not doing many swaps at all” even though that assumption could never be proven, or (2) realize that the data are quite inaccurate—maybe biased, maybe not, but certainly inaccurate—and simply pretend that the data are accurate. Ohm argued that neither of these worlds are really acceptable.

Where the historical status quo is opaque and “black magic,” Ohm argued that the differential privacy approach is transparent and “flipping the light switch on:” opening up the mechanism, releasing the source code, having an open debate about “this thing called ϵ,” engaging researchers on incorporating newly protected data into their routines and developing techniques for assessing errors, and—in general—applying science to this enterprise of privacy. Ohm said that “this is why I’m a little befuddled, frankly, by the fear” expressed in many of the presentations. How, he asked, can the new, transparent, open, and scientific process be seen as threatening in any way except only that it is “different”? What Ohm has heard expressed by the user community “amounts to a fear of the unknown,” but he contended that the differential privacy–based disclosure avoidance system is “ a move to the known” (or at least “to the much more knowable”).

Ohm argued that the Census Bureau should be commended for respecting fundamental human rights, respecting the law, and creating “such a welcome change” in “implementing this very difficult thing that they’ve been asked to do.” He also said that he is overjoyed about what the Census Bureau’s work means for the research in the field of statistically provable privacy, serving as “the shot in the arm that this field needed.” He closed by noting his expectation that “landmark advances in privacy protection” would be made based on what the Census Bureau has put in motion now and that the Census Bureau is “probably going to find ways to extract more utility out of these techniques as well.”

10.5 LESSONS FROM PRIVACY WORK IN HEALTH DATA

Making a brief introduction, Goroff joked that, on one hand, health data can help find cures and improve lives, while on the other it has become clear from this discussion that “my personal health data can be used against me.” How is census participation different? Daniel Barth-Jones (Mailman School of Public Health, Columbia University) quickly responded that, for one thing, census information is being collected on the entire population simultaneously, on a

Page 138 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

compulsory basis, which is quite a different situation from the health context where there is a great deal of ambiguity as to whether a particular person’s information is in the dataset or not.

Barth-Jones noted his background as both an epidemiologist and a researcher in statistical disclosure control, and said that his comments would try to address the larger context here, on the important historical and societal debate now underway. His dual fields are on a “public policy collision course,” precisely because the “epidemiologic triad” that drives everything in that field of study—characteristics of persons, places, and times—is exactly the type of information that is now serving as “quasi-identifiers” in reidentification studies (and attacks). Census data, he argued, are an invaluable public good, serving as the foundation for many political, social, and scientific purposes. Knowing that Ohm would argue against being so “utilitarian,” Barth-Jones said that the task at hand here is very much one of balancing accuracy and privacy protections. His contribution here would be a big-picture perspective on the “nuance” required in that balancing.

Barth-Jones said that two particularly relevant concepts from the arena of medical research ethics are beneficence—maximizing possible benefits and minimizing possible harms—and justice (the equitable distribution of the benefits of research or of the risks associated with that research). A third concept, rarity, simply describes the characteristic of being unusual relative to other parties and is a major factor (at the individual and group levels) in both the distribution of risks and benefits associated with reidentification.

So, Barth-Jones said, this is definitely a tradeoff, and the “inconvenient truth” is that “we can’t have both perfect information and information quality at the same time.” He said that he thinks of the tradeoff between disclosure protection and information as operating on a logarithmic scale: relatively small losses in information quality making it possible to achieve very important privacy protections and reduce reidentification risks by orders of magnitude.

Barth-Jones said that the discussion was reminiscent of a question that he posed in his contribution to the Harvard Law School Petrie-Flom Center’s Online Symposium on the Law, Ethics, and Science of Reidentification Demonstrations in 2013, a question on “ethical equipoise:”²

Is it an ethically compromised position, particularly in the coming age of personalized medicine, if we end up purposefully masking the racial, ethnic or other group membership status information (e.g. American Indians or [Latter-Day Saints] Church members, etc.) for certain individuals, or for those with certain rare genetic diseases/disorders, in order to protect them against supposed re-identifications? In making this ethical determination,

___________________

² See https://blog.petrieflom.law.harvard.edu/2013/05/29/public-policy-considerations-for-recent-re-identification-demonstration-attacks-on-genomic-data-sets-part-1-re-identification-symposium/. The excerpt on Barth-Jones’ workshop slides is structured slightly differently from the original blog publication.

Page 139 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

we must, of course, recognize that by doing so, we would also deny them the benefits of research conducted with de-identified data that could help address their health disparities, find cures for their rare diseases, or facilitate “orphan drug” research that would otherwise not be economically viable.

Barth-Jones said that specific language choices matter greatly in these discussions, and that language is crucial for establishing trust and maintaining transparency. The Census Bureau’s pledges and the pledges of Title 13 speak of “confidentiality,” but he argued that it’s not clear that the public will understand the connection to “differential privacy” or “formal privacy.” In particular, he worried about misunderstanding of the meaning of “differential.” Barth-Jones added that things get more complicated because we tend to speak in terms of “privacy guarantees” and “disclosure avoidance” when both those terms imply more of an elimination of reidentification risk than is actually the case. He referred to Chris Clifton’s work that argues that the relationship between the privacy-loss budget ϵ and actual reidentifiability is not easy to chart. Barth-Jones noted that the Census Bureau’s own work in this regard, the reprise of the database reconstruction attack using DAS-privatized microdata for different ϵ settings (Figure 2.2) “started to plateau fairly strongly” for larger values of ϵ, leaving levels of reidentification risk that are beyond de minimis. We can talk about “disclosure avoidance,” he said, but “perhaps ‘disclosure reduction’ is better language.”

Barth-Jones cast the difference between traditional statistical disclosure limitation work and differential privacy as the difference between focusing on a limited set of quasi-identifiers—data details that are assessed for their replicability, accessibility, and distinguishability in evaluating reidentification risk—and making the assumption that “everything is personally identifiable information.” That is, differential privacy makes the assumption that all data elements are potentially knowable by intruders and are consequentially equally useful for reidentification and equally sensitive (or capable of inflicting privacy harm). He argued that the particular context of a census, returning to the opening point about data collection being performed on the entire population, is such that the stronger assumptions of differential privacy need to be considered.

“What’s to love about differential privacy?” Barth-Jones asked—and answered, with several points. He cited privacy guarantees and the “mathematical elegance” of the theory as chief among the advantages. He also said that the theory’s broad assumptions about intruder knowledge and capabilities treat those attackers as “nearly omniscient, nearly omnipotent in their computing capabilities, and constantly co-conspiring,” which creates a “level of distrust” that is actually useful given the extent of harm that the attackers can inflict. He also cited the just-mentioned assumptions that all pieces of information are potentially harmful for reidentification and attribute inference. He added that the theory is attractive because it is composable and imposes consistency,

Page 140 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

though he did not elaborate on the meaning of those points. Barth-Jones closed his discussion of the positive aspects of differential privacy by stating that it is difficult to think of a use case that needs or demands the application of differential privacy more than a complete-population decennial census, and that there is no organization that he would trust more to implement differential privacy well than the Census Bureau.

That said, he also needed to ask the dual question, and privacy “guarantees” (this time, pointedly in quotes) topped the list of “what’s not to love about differential privacy,” hearkening back to the early discussion of “guarantee” having some overly strong connotations. Other disadvantages mentioned by Barth-Jones include the complexity of communicating and describing the approach to the public (related to the previously mentioned trust and transparency issues). He also acknowledged that the same very broad and strong assumptions that make differential privacy desirable in the new threat environment serve to impose serious “accuracy costs.” The Census Bureau’s need to impose certain invariants in the process also incurs “accuracy costs.” On a related point, he commented that pure differential privacy strictly enforces the “privacy” part of the equation but “only optionally enforces the ‘accuracy’ side” through judicious selection of ϵ. Finally, he noted, differential privacy approaches are still potentially susceptible to some avenues of attack. Repeated instantiations of the process (such as multiple runs of the planned 2020 Census Disclosure Avoidance System) might still be revelatory, as long as they emerged in full and “correlated observations don’t receive the same guarantees.”

Barth-Jones closed with quick mention of three “not-so-random concerns” concerning the 2020 Census DAS methodology going forward. He noted particular interest in the off-spine geographic layer of ZIP Code Tabulation Areas (ZCTAs), which are very important and commonly used in analyzing health data. He expressed worries abut “subtraction geographies,” or the potential for attackers to target small areas by subtracting overlaid geographic levels with partially overlapping borders. Finally, he noted concern about the “competition” that would flare up between “individuals, groups, researchers, and politicians” for shares of the privacy-loss budget ϵ.

10.6 DISCUSSION

Goroff opened the discussion period with a provocative question: do we really need census blocks? He clarified that many of the application areas that he hears about really strike him as instances of over-fitting and that, even if we were “just counting rocks and stars and fish and had no privacy concerns whatsoever,” the application of differential privacy techniques might still be desirable, if only to prevent over-fitting.

Page 141 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

boyd answered first that this group isn’t in a position to say whether we “need” census blocks and that that is a question for the broader community. She added her understanding that systematic delineations of census blocks are a relatively recent phenomenon and that, though they are used heavily in redistricting, there are camps of census stakeholders on both sides. Some of them have wanted to eliminate blocks (or block-level tabulations) for quite some time, while others staunchly defend their production. To boyd, the question boils down to the level of detail we can stomach letting go of and decisions about whether eliminating some block-level tabulations would improve matters for other stakeholders. boyd added that another area that should be discussed is the aggregation of responses to the race question that involve more than three of the major race categories. Is the full six-way combination necessary to tabulate, and at what geographic detail, or could we stomach removing that level of specificity? boyd said that she was surprised that there hasn’t been more public conversation along these lines and wondered aloud what it would mean to put out a Federal Register notice and ask bluntly whether we can do without tables at the census block level or splits of the race question by 3+ categories.
Barth-Jones answered that the question of blocks is a salient point. In making these privacy-accuracy tradeoffs, something has to give in the process, and restricting what is tabulated at the block level strikes him as an instance where some give and take can occur.
Nissenbaum agreed that something has to give and that the idea of ending block-level tabulation deserves consideration. She added that another alternative that might be considered is restricting access to the more detailed data products. Perhaps that’s an alarming concept, she said, but that method of building in more accountability for what people do with the data products should be on the table as something that could give.

Tene and Ohm both took turns in response, though both addressed topics other than census blocks. Tene agreed with what the fellow panelists said, but said that he had to say that he thought Ohm’s aversion to tradeoffs as a utilitarian sort of equation is “overstated.” The GDPR provides for companies and organizations to “process data for their legitimate interest and balance that against the fundamental rights of individuals.” The Federal Trade Commission’s “unfairness doctrine” as it applies to privacy cases has a cost-benefit analysis, weighing what harm is done but also what benefit is on the other side. He also said that even the deidentification rule pursuant to the Health Insurance Portability and Accountability Act (HIPAA) allows activities using information that is no longer considered “personal health information” if it is deidentified to some degree, which amounts to an implicit tradeoff. Tene said that ϵ

Page 142 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

in differential privacy settings is inherently a utility-privacy tradeoff. In his response, Ohm recalled his time as the White House privacy appointee to the Commission on Evidence-Based Policymaking, which he said was a struggle (as one of two true “privacy people” on a commission of 12). The cultural clash notwithstanding, which he though was similar to that playing out in the workshop room, the commission ultimately agreed to a framework that sounds like what he understood Nissenbaum to be driving towards: “the notion that you can inject friction in meaningful ways that serves multiple purposes simultaneously,” in ways that force everyone to consider the privacy of the person on the other side as not just burden but a person whose individual rights deserve respect. While he didn’t agree with everything in the commission’s final report, Ohm said that he signed it because of the points in it that speak about the value and virtue of that friction, and also because it speaks to the next generation (today’s graduate students) that will make advances in the more privacy-friendly world we’re discussing.

Eddie Hunsinger (California Department of Finance) asked if the Census Bureau was setting too high a bar with differential privacy in the 2020 Census. Notwithstanding the effects and bias that are necessarily introduced by the “cosmetic” post-processing, basic things like age structure and total population of small villages are necessarily impacted (and, in some senses, threatened) by the pure differential privacy aspect of noise infusion. He asked whether the panelists were concerned about private-sector data brokers emerging in place of the Census Bureau, eroding people’s privacy whether the Census Bureau is providing data or not. (Goroff sought to queue up three questions from the audience in a burst and then have the panelists react, but time was such that the answers were more omnibus or generic; hence, there was no direct answer to Hunsinger’s question.)

Gwynne Evans-Lomayesva (National Congress of American Indians, NCAI) noted that NCAI is currently engaged in efforts to try to bridge distrust of the government and ensure a complete count in 2020. All of the panelists had spoken about the importance of ensuring privacy in that regard, but she noted Randall Akee’s findings earlier in the workshop (Section 8.1) of entire American Indian and Alaska Native (AIAN) communities being erased in the 2010 Demonstration Data Products (DDP). What happens after 2020 if communities look at the privatized data resulting from the census and conclude that it doesn’t match with the reality they know? How does that affect the balance, and what is the acceptable privacy loss for those communities? Barth-Jones said that the situation with AIAN communities is particularly salient and important to discuss, as it involves a lot of the complexity he was alluding to. The really different tradeoffs for that community will involve protecting that community’s essential “rarity” while also reconciling their privacy needs at the individual and group levels. He cautioned that “it’s almost impossible to reach a perfect solution for all parties concerned.”

Page 143 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

Nancy Krieger (Harvard T.H. Chan School of Public Health) asked about bracketing (in other words, transitioning from old disclosure avoidance regimes to the new regime). She noted that she would hate to have an overgeneralized claim made about all data users, all researchers, and all public health stakeholders that we don’t care about privacy rights or government surveillance, because that certainly is not true. That said, those researchers and stakeholders are slowly becoming aware that switching to differential privacy–based methods will result in some “discontinuities,” that is, differences that mean real things to people and their communities. Krieger says that there needs to be real resources on training and education about this transition because it can’t be a matter of waiting for the new, next, more privacy-familiar generation “come up and say great, whoopee, we never knew what the old data were like, anyway.” She noted her own personal alarm that many census data users are not yet aware of the change that is coming and asked the panel for their thoughts about what attention and what resources are going to be available to actually make this transition happen.

boyd said that she did not view this workshop as a one-and-done session to find errors in the 2010 DDP, and does not expect that the Census Bureau views it that way either. She commended the participants for pointing out important issues in meaningful advocacy and the Census Bureau for trying to do this as a technical consultancy. She expressed hopes that the whole data user community would continue to engage, challenge, and improve it. She said it is becoming clear to her that the data user community and the Census Bureau too often operate in isolation without a feedback mechanism between them, which is one reason why this is important. boyd said that she accepted Krieger’s point that local governments and public health officials don’t have the resources that they need, but also noted that she finds it “really deeply disturbing” that the United States has so much money invested in surveillance. We can talk about the census budget, she said, but we also need to talk about the military surveillance budget and the law enforcement surveillance budgets. She added that she finds it funny that, on this stage and in this environment, “I definitely feel like a radically pro-privacy person,” but in her usual sphere of operations, “I’m usually the radical [for] ‘give data access’ ”—to wit, in her usual communities, the instinct has become one of only putting out the bare-minimum, absolutely required files, and her sense is “we need to back away from that frame.” She urged the Census Bureau and the data user community to both continue their consultancies and to bring in the perspective of the affected communities.

Ohm pushed back slightly on the idea that education and training was going to be a difficult, resource-intensive lift; “I want to disabuse you of the notion that differential privacy is rocket science.” He said that he has an undergraduate degree in computer science and doesn’t “pretend to be the world’s expert in differential privacy” but has read enough papers to understand the basic question and finds the material far less complex than learning multivariate

Page 144 Cite

Suggested Citation:"10 Panel Discussion on Key Privacy Issues." National Academies of Sciences, Engineering, and Medicine. 2020. 2020 Census Data Products: Data Needs and Privacy Considerations: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25978.

×

regression or other statistical methods for the first time. So, he said, he disagreed with the notion that the adjustment to the new system would be so onerous as to stop the work. “I think this community will adapt.”

Nissenbaum acknowledged Evans-Lomayesva and Krieger’s “really excellent concerns,” and again suggested restructuring access to the full census tabulations by restricting it to those “who are responsible and can swear to only utilizing it for the particular purposes”). Goroff then closed the session with an appeal to continue this discussion a little less formally and saluted this community for having the tough conversations and working on the difficult challenges.