Skip to main content

Currently Skimming:

Appendix B: Concepts and Methods for De-identifying Clinical Trial Data
Pages 203-256

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 203...
... (2014a,b) , to examine how to 1  Thisbackground report was commissioned by the Institute of Medicine Committee on Strategies for Responsible Sharing of Clinical Trial Data, written by Khaled El Emam, University of Ottawa, and Bradley Malin, Vanderbilt University.
From page 204...
... It should be recognized that de-identification is not, by any means, the only privacy concern that needs to be addressed when sharing clinical trial data. In fact, there must be a level of governance in place to ensure that the data will not be analyzed or used to discriminate against or stigmatize the participants or certain groups (e.g., religious or ethnic)
From page 205...
... Examples of such releases include the publicly available clinical trial data from the International Stroke Trial (IST) (Sandercock et al., 2011)
From page 206...
... . The different approaches for sharing clinical trial IPD are summarized in Figure B-1.
From page 207...
... Microdata Online Portal Public LEAST CONTROL BY SPONSOR LIMIT CONSTRAINTS ON QI Formal Request MOST CONTROL BY SPONSOR SIGNIFICANT CONSTRAINTS ON QI Risks Risks • Deliberate re-identification • Inadvertent re-identification • Inadvertent re-identification • Accidental release and re identification FIGURE B-1  Different approaches for sharing clinical trial data. NOTE: QI = qualified investigator.
From page 208...
... For the disclosure of clinical trial data, the HIPAA Privacy Rule de-identification standards offer a practically defensible foundation even if they are not a regulatory requirement.
From page 209...
... The Privacy Rule requires that a number of these data elements be "removed." However, there may be acceptable alternatives to actual removal of values as long as the risk of reverse engineering the original values is very small. Compliance with the Safe Harbor standard also requires that the sponsor FIGURE B-2  The two de-identification standards in the HIPAA Privacy Rule.
From page 210...
... , potentially reducing the utility of the data. Many meaningful analyses of clinical trial data sets require the dates and event order to be clear.
From page 211...
... H ­ arbor, the HIPAA Privacy Rule provides for an alternative in the form of the Expert Determination method. This method has three general requirements: • The de-identification must be based on generally accepted statisti cal and scientific principles and methods for rendering information not individually identifiable.
From page 212...
... Unique and Derived Codes Under HIPAA According to the 18th item in Safe Harbor (see Box B-2) , "any unique identifying number, characteristic, or code" must be removed from the data set; otherwise it would be considered personal health information.
From page 213...
... Therefore, because the identified data exist with the sponsor, the data provided to the QI cannot be considered de-identified. This is certainly not practical because the original data are required for legal reasons (e.g., clinical trial data need to be retained for an extended period of time whose duration depends on the jurisdiction)
From page 214...
... Arguably, the term "anonymization" would be the appropriate term to use here given its more global utilization. However, to remain consistent with the HIPAA Privacy Rule, we use the term "de-identification" in this paper.
From page 215...
... .3 Furthermore, privacy statutes and 3  This statement does not apply to genomic data. See the summary of evidence on genomic data later in this paper for more detail.
From page 216...
... Although participants may consider certain types of attribute disclosure to be a privacy violation, it is not considered so when the objective is anonymization of the data set. Technical methods have been developed to modify the data to protect against attribute disclosure (Fung et al., 2010)
From page 217...
... For example, the background knowledge may be available because the adver sary knows a particular target individual in the disclosed clinical trial data set, an individual in the data set has a visible charac teristic that is also described in the data set, or the background knowledge exists in a public or semipublic registry. Examples of quasi-identifiers include sex, date of birth or age, locations (such as
From page 218...
... Therefore, our focus is on these two types of variables. Classifying Variables An initial step in being able to reason about the identifiability of a clinical trial data set is to classify the variables into the above categories.
From page 219...
... , then both the zip code and date of birth are knowable. Knowability will depend on whether an adversary is an acquaintance of a data subject.
From page 220...
... This is an important decision because the techniques often used to protect direct identifiers distort the data and their truthfulness significantly. Is it possible to know which fields will be used for analysis at the time that de-identification is being applied?
From page 221...
... How Is Re-identification Probability Measured? Measurement of re-identification risk is a topic that has received extensive study over multiple decades.
From page 222...
... If a direct identifier does exist in a clinical trial data set, then by definition it will be considered to have a very high risk of re-identification. Strictly speaking, the probability is not always 1.
From page 223...
... . In general, there is a trade-off between the level of detail provided for a data concept and the size of the corresponding equivalence classes, with more detail being associated with smaller equivalence classes.
From page 224...
... By definition, the average risk for a data set will be no greater than the maximum risk for the same data set.
From page 225...
... The maximum risk is still 1, but the average risk has declined to 0.33. The average risk will be more sensitive than the maximum risk to modifications to the data.
From page 226...
... in the disclosed clinical trial data set with the maximum probability of re-identification. Therefore, it is prudent to protect against such an adversary by measuring and managing maximum risk.
From page 227...
... • The trial participants self-reveal that they are taking part in a par ticular trial, for example, on social networks or on online forums. If it is not possible to know who is in the data set, the trial data set can be considered to be a sample from some population.
From page 228...
... When the trial data set is treated as a sample, the maximum and average risk need to be estimated from the sample data. The reason is that in a sample context, the risk calculations depend on the equivalence class size in the population as well.
From page 229...
... will depend on the security and privacy controls that the data recipient has in place and the contractual controls that are being imposed as part of the data sharing agreement. The second term, Pr(re-id | attempt)
From page 230...
... Note that these figures are averages and may be adjusted to account for variation. For a nonpublic data release, then, there are three types of attacks for which the re-identification risk needs to be measured and managed.
From page 231...
... Historically, data custodians (particularly government agencies focused on reporting statistics) have used the "minimum cell size" rule as a threshold for deciding whether to de-identify data (Alexander and Jabine, 1978; Cancer Care Ontario, 2005; Health Quality Council, 2004a,b; HHS, 2000; Manitoba Center for Health Policy, 2002; Office of the Information and Privacy Commissioner of British Columbia, 1998; Office of the Information and Privacy Commissioner of Ontario, 1994; OMB, 1994; Ontario Ministry of Health and Long-Term Care, 1984; Statistics Canada, 2007)
From page 232...
... The variability is due, in part, to different tolerances for risk, the sensitivity of data, whether a data sharing agreement is in place, and the nature of the data recipient. A minimum cell size criterion amounts to a maximum risk value.
From page 233...
... The choice of a metric is a function of whether the clinical trial data set will be released publicly. For public data sets, it is prudent to use maximum risk in measuring risk and setting thresholds.
From page 234...
... This dimension compasses the motives and the capacity of the QI to re-identify the data, considering such issues as conflicts of interest, the potential for financial gain from re-identification, and whether the data recipient has the skills and financial capacity to re-identify the data (a checklist is available in El Emam et al.
From page 235...
... In such cases, the sharing of clinical trial data is not considered as invasive to privacy as opposed to cases in which consent is not sought. Multiple levels of notice and consent can exist for disclosure of de-identified data.
From page 236...
... However, this method does not ensure that the risk of re-identification is very small, and therefore the data will still be considered personal health information. For public data releases, there are no contracts and no expectation that any mitigating controls will be in place.
From page 237...
... The reason is that this approach is quite common and is being adopted to de-identify clinical trial data. The HIPAA Privacy Rule's Safe Harbor Standard We first consider the variable list in the HIPAA Privacy Rule Safe Harbor method.
From page 238...
... This will be the case if the data set is a random sample from the population. If these assumptions are met, the applicability of Safe Harbor to a clinical trial data set will be defensible, but only if there are no international participants.
From page 239...
... FIGURE B-6  Inhabited three-digit zip codes with fewer than 20,000 inhabitants from the 2010 U.S. census.
From page 240...
... More important, a number of de-identification standards proposed by sponsors have followed similar approaches for sharing clinical trial data from participants globally (see the standards at ClinicalStudyDataRequest.
From page 241...
... Masking techniques include the following: (1) removal of the direct identifiers, (2)
From page 242...
... data from the source database. Importing data from the source database may be a simple or complex exercise, depending on the data model of the source data set.
From page 243...
... Alternatively, some of the assumptions about acceptable data utility may need to be renegotiated with the data users. Step 10: Perform diagnostics on the solution.
From page 244...
... To ensure that the deidentification solution is truly within this narrow operating range, it is necessary to perform a pilot evaluation on one or more representative clinical trial data sets and compare the before and after analysis results using exactly the same analytic techniques. Obtaining similar results for a de-identified clinical trial data set that is intended for public release will be more challenging than disclosing the data set to a QI with strong mitigating controls.
From page 245...
... Governance Governance is necessary for the sponsor to manage the risks when disclosing clinical trial data, and it requires that a set of additional practices be in place. What would be characterized as high-maturity sponsors will have a robust governance process in place.
From page 246...
... If additional funding is available, those who conduct these attacks can purchase and use commercial databases to reidentify data subjects. Appropriate Contracts Additional governance elements become particularly important when a sponsor discloses data to a QI under a contract.
From page 247...
... The global process ensures consistency across all data releases. This process must then be enacted for each clinical trial data set, and this may involve some customization to address specific characteristics of a given data set.
From page 248...
... An internal sponsor ethics review council will include a privacy professional, an ethicist, a lay person representing the participants, a person with knowledge of the clinical trials business at the sponsor, and a brand or public relations person. For public data releases, there is no analysis protocol or a priori approval process, and therefore it will be challenging to provide assurances about attribute disclosure.
From page 249...
... and approximate age of the individual. Although such information may be permitted within a Safe Harbor de-identification framework, a statistical assessment of the potential identifiability of such information would indicate that such ancillary information might constitute an unacceptably high rate of re-identification risk.
From page 250...
... 2013a. Privacy-enhancing technologies for medical tests using genomic data.
From page 251...
... 2013. Are clinical trial data shared sufficiently today?
From page 252...
... 2013. Project data sphere to make cancer clinical trial data publicly available.
From page 253...
... 2012. Guidance regarding methods for deidentification of protected health information in ac cordance with the Health Insurance Portability and Accountability Act (HIPAA)
From page 254...
... 2013. Pre paring for responsible sharing of clinical trial data.
From page 255...
... 2011. The International Stroke Trial database.
From page 256...
... 2013. Secure use of individual patient data from clinical trials.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.