In Chapter 2, the committee recommends a framework for the US Food and Drug Administration (FDA) regulatory decision-making process in which scientific evidence plays a critical role, together with other factors including ethical considerations and the perspectives of patients and other stakeholders. This chapter focuses on the evaluation of the scientific evidence and on how FDA should use evidence in its decisions. Just as courts determine when evidence is admissible and which standard of proof to apply in a given case, scientific evidence must be evaluated for its quality and applicability to the public health question that is the focus of regulatory decision-making. FDA needs to base its decisions on the best available scientific evidence related to that question. Different people, however, can interpret and judge scientific evidence in various ways. Decisions in which there is disagreement among experts about what decisions are best supported by a given body of evidence are among the most difficult that FDA must make. For these decisions to properly incorporate all the relevant uncertainties and values, the regulators need to understand the bases of the various judgments that the experts are making. As has been shown in many difficult cases that FDA has had to decide, evidence does not speak for itself.
This chapter will categorize and discuss the sources of technical disagreements between experts about the kinds of data that FDA typically deals with. It will start with a short primer on approaches to statistical inference, with an introduction to Bayesian methods, followed by a discussion of the distinctions between scientific data and evidence. It then discusses why scientists sometimes disagree about the evidence of a drug’s benefits and risks and how their disagreements may affect regulatory decision-making.
Although the terms data and evidence are often used interchangeably, data is not a synonym for evidence. The Compact Oxford English Dictionary defines data as “facts and statistics collected together for reference or analysis” and evidence as “the available body of facts or information indicating whether a belief or proposition is true” (Oxford Dictionaries, 2011). The difference is whether or not the information is being used to draw scientific conclusions about a specific proposition. In the context of a drug study, the “proposition” is a hypothesis about a drug effect, often stated in the form of a scientific question, such as “Do broad-spectrum antibiotics increase the risk of colitis”? In the broader context of FDA’s regulatory decisions, the proposition may be implicit in the public health question that prompts the need for a regulatory decision, such as, “Does the risk of colitis caused by broad-spectrum antibiotics outweigh their benefits to the public’s health”? In this way, evidence is defined with respect to the questions developed in the first step of the decision-making framework described in Chapter 2.
Statistical methods help to ascertain the “strength of the evidence” supporting a given hypothesis by measuring the degree to which the data support one hypothesis rather than the other. The evidence in turn affects the likelihood that either hypothesis is true. The most common scientific hypothesis in the realm of drug evaluation is the “null hypothesis”—that in a given treated population, the drug has no effect relative to a comparator treatment. For the concept of evidence to have meaning, however, there must be at least one other hypothesis under consideration, such as that the drug has some effect.
A small change in the scientific hypotheses being compared can change the strength of the evidence provided by a given set of data. For example, if the question above changed from whether broad-spectrum antibiotics produce any increase in the risk of colitis to whether broad-spectrum antibiotics produce a clinically important increase in the risk of colitis—say, an increase of more than 10 percent—the strength of the evidence provided by the same data could change. Where one observer might see a four percent increase in risk as strong evidence of some excess risk, another could regard it as strong evidence against a 10 percent increase in risk.1 Agreement on the strength of the evidence therefore requires agreement on the hypotheses being contrasted and on the public health questions that gives rise to them.
1Confusion can result from use of the word significant to describe an effect that is both statistically significant and clinically relevant; the latter is often termed clinically significant. The two uses should remain separate.
Good science, together with proper statistics, has a dual role. The first role is to decrease uncertainty about which hypotheses are true; the second is to properly measure the remaining uncertainty. These are carried out in part through a process called statistical inference. Statistical inference involves the process of summarizing data, estimating the uncertainty around the summary, and using the summary to reach conclusions about the underlying truth that gave rise to the data.
The two main approaches to statistical inference are the standard “frequentist” approach and the Bayesian approach. Each has distinctive strengths and weaknesses when used as bases for decision-making; including both approaches in the technical and conceptual toolbox can be extraordinarily important in making proper decisions in the face of complex evidence and substantial uncertainty. The frequentist approach to statistical inference is familiar to medical researchers and is the basis for most FDA rules and guidance. The Bayesian approach is less widely used and understood, however, it has many attractive properties that can both elucidate the reasons for disagreements, and provide an analytic model for decision-making. This model allows decision-makers to combine the chance of being wrong about risks and benefits, together with the seriousness of those errors, to support optimal decisions.
The frequentist approach employs such measures as P values, confidence intervals, and type I and II errors, as well as practices such as hypothesis-testing. Evidence against a specified hypothesis is measured with a P value. P values are typically used within a hypothesis-testing paradigm that declares results “statistically significant” or “not significant”, with the threshold for significance usually being a P value less than 0.05. By convention, type I (false-positive) error rates in individual studies are set in the design stage at 5 percent or lower, and type II (false-negative) rates at 20 percent or below (Gordis, 2004).
In the colitis example, if the null hypothesis posits that broad-spectrum antibiotics do not increase the risk of colitis, a P value less than 0.05 would lead one to reject that null hypothesis and conclude that broad-spectrum antibiotics do increase the risk of colitis. The range of that elevation statistically consistent with the evidence would be captured by the confidence interval. If the P value exceeded 0.05, several conclusions could be supported, depending on the location and width of the confidence interval; either that a clinically negligible effect is likely, or that the study cannot rule out either a null or clinically important effect and thus is inconclusive. In the drug-approval setting, the FDA regulatory threshold of “substantial evidence”2 for effectiveness is generally defined as two well controlled trials that have achieved statistical significance on an agreed upon endpoint, although there can be exceptions (Carpenter, 2010; Garrison et al., 2010).
221 USC § 355(d) (2010).
Hypothesis-testing provides a yes-or-no verdict that is useful for regulatory purposes, and its value has been demonstrated over time, both procedurally and inferentially. Its emphasis on pre-specification of endpoints, study procedures and analytic plans has regulatory and often inferential benefits. But hypothesis tests, P values, and confidence intervals do not provide decision-makers with an important measure—the probability that a hypothesis is right or wrong. In settings where a difficult balancing of various decisional consequences must be made in the face of uncertainty about both the presence and magnitude of benefits and risks, the probability that a given hypothesis is true plays a central role. The failure to assign a degree of certainty to a conclusion is a weakness of the frequentist approach when it is used for regulatory decisions (Berry et al., 1992; Etzioni and Kadane, 1995; IOM, 2008; Parmigiani, 2002).
In contrast, the Bayesian approach to inference allows a calculation on the basis of results from an experiment of how likely a hypothesis is to be true or false. However, this calculation is premised on an estimated probability that a hypothesis is true prior to the conduct of the experiment, a probability that is not uniquely scientifically defined and about which scientists can differ. Both in spite of this and because of this, Bayesian approaches can be very useful complements to traditional frequentist analyses, and can yield insights into the reasons why scientists disagree, a topic that will be discussed in more depth later in this chapter.
The use of Bayesian approaches is not new to FDA. FDA’s Center for Devices and Radiological Health (CDRH) has published guidance for the use of Bayesian statistics in medical device clinical trials (FDA, 2010a) and FDA has used Bayesian approaches in regulatory decisions. A 2004 FDA workshop on the use of Bayesian methods for regulatory decision-making included extensive discussion by FDA scientists, as well as Center for Drug Evaluation and Research (CDER) and CDRH leadership, of ways in which Bayesian approaches could enhance the science of premarketing approval.3 Campbell (2011), director of the CDRH Biostatistics division, discussed the uses of Bayesian methods for FDA decision-making, and presented 17 requests for premarketing approval submitted to and approved by the CDRH for medical devices that used Bayesian methods. Although Bayesian methods have been little used by CDER, Berry (2006) discusses how a Bayesian meta-analysis served as the basis for a CDER approval of Pravigard™ Pac (co-packaged pravastin and buffered aspirin) to lower the risk of cardiovascular events. Bayesian sensitivity analyses were used to help evaluate the literature investigating the possible association between antidepressants and suicidal outcomes (Laughren, 2006; Levenson and Holland, 2006), elaborated later in Kaizar (2006). Finally, FDA staff has recently proposed Bayesian methodology for analysis of safety endpoints in clinical trials (McEvoy et al., 2012).
3Published papers from the workshop are available in the August 2005 issue of Clinical Trials (2:271-378).
The Bayesian approach does not use a P value to measure evidence; rather, it uses an index called the Bayes factor (Goodman, 1999; Kass and Raftery, 1995). The Bayes factor encodes mathematically the principle presented earlier—that the role of evidence is to help adjudicate between two or more competing hypotheses. The Bayes factor modifies the probability of whether a hypothesis is true. Decision-makers can then use that probability to characterize the likelihood that their decisions will be wrong. In its simplest form, Bayes theorem can be defined in the following equation (Goodman, 1999; Kass and Raftery, 1995):
|The odds that a hypothesis is true after new evidence||=||The odds that a hypothesis is true before new evidence||×||The strength of new evidence (the Bayes factor)|
The Bayes factor is sometimes regarded as the “weight of the evidence” comparing how strongly the data support one hypothesis (or combination of hypotheses) to another (Good, 1950; Kass and Raftery, 1995). Most important is the role that the Bayes factor plays in Bayes theorem; it modifies the probability that a given hypothesis is true. This concept that a hypothesis has a certain “truth probability” has no counterpart in standard frequentist approaches.
There is not a one-to-one relationship between P values and Bayes factors, because the magnitude of an observed effect and the prior probabilities of hypotheses also can affect the Bayes factor calculation itself. But in most common statistical situations, there exists a strongest possible Bayes factor, and that can be defined as a function of the observed P value. That relationship can be used to calculate the maximum chance that the non-null hypothesis is true as a function of the P value and a prior probability (Goodman, 2001; Royall, 1997).
Assume that the null hypothesis is that a given drug does not cause a given harm, and that the alternative hypothesis is that it does elevate the risk of that harm. Table 3-1 shows how a given P value (translated into the strongest Bayes factor) alters the probability of the hypothesis of harm, defining the null hypothesis as stating that a given drug does not harm, and the alternative hypothesis is that it does elevate the risk of that harm. For example, if a new randomized controlled trial (RCT) yields a P value of 0.03 for a newly reported adverse effect of a drug and there was deemed to be only a 1 percent chance before the RCT of that unsuspected adverse effect being caused by the drug, the new evidence increases the chance of the causal relationship to at most 10 percent (see Table 3-1). A regulatory decision predicated on the harm being real would therefore be wrong more than 90 percent of the time.
Without a formal Bayesian interpretation, that high probability of error would not be apparent from any standard analysis. Using conventional measures, such a study might report that “a previously unreported association of tinnitus was observed with the drug, OR [odds ratio] = 3.5, 95% CI [confidence interval] 1.1 to 11.1. P = 0.03”. This statement does not actually indicate how likely it is
TABLE 3-1 Maximum Change in the Probability of a Drug Effect as a Function of P Value and Bayes Factor, Calculated by Using Bayes’ Theorem
|P Value in
of an Effect, %b
|Maximum Probability After the New Study, %|
aThe qualitative descriptor of the strength of the evidence is made on the basis of the quantitative change in the probability of truth of a null-null drug effect.
bThe prior truth probabilities of 1%, 25%, or 50% are arbitrarily chosen to span a wide range of strength of prior evidence. The shaded prior probability illustrates the minimum prior probability required to provide a 95% probability of a drug effect after observing a result with the reported P value.
SOURCE: Modified from Goodman (1999).
that the drug actually raises the risk of tinnitus. For that, a prior probability is needed, and the Bayes factor. If the mechanism or some preliminary observations justified a 25 percent prior chance of a harmful effect, the same evidence would raise that to at most a 78 percent chance of harm—that is, at least a 22 percent chance that the drug does not cause that harm. Table 3-1 shows that after observing P = 0.03 for an elevated risk of harm, in order to be 95 percent certain that this elevation was true, the prior probability of a risk elevation would have to have been at least 67 percent before the study. That might be the case if there was an established mechanism for the adverse effect, if other drugs in the same class were known to produce this effect, or if a prior study showed the same effect.
In practice, however, there exist no conventions or empirical data to determine exactly how to assign such prior probabilities, although the elicitation of prior probabilities from experts has been much studied (Chaloner, 1996; Kadane
and Wolfson, 1998). FDA incorporated the notion of a prior informally in its incorporation of “biologic plausibility” into decision-making of how to respond to drug safety signals that arise in the course of pharmacovigilance, in March 2012 draft guidance (FDA, 2012):
CDER will consider whether there is a biologically plausible explanation for the association of the drug and the safety signal, based on what is known from systems biology and the drug’s pharmacology. The more biologically plausible a risk is, the greater consideration will be made to classifying a safety issue as a priority.
As demonstrated in the above paragraph, biologic plausibility and other forms of external evidence are currently accommodated qualitatively; Bayesian approaches allows that to be done quantitatively, providing a formal structure by which both prior evidence and other sources of information (for example, on common mechanisms underlying different harms, or their relationship to disease processes) should affect decisions.
This discussion illustrates a number of important issues
• Given new evidence, the probability that a drug will be harmful can vary widely depending on the strength of the prior or external information, represented as a prior probability distribution.
• The chance that a drug will be harmful, based on P values for a harmful effect in the borderline significant range (0.01–0.05), is often far lower than is suspected, unless there are fairly strong reasons to believe in the harm before the study.
• The Bayesian approach allows the calculation of intermediate levels of certainty (for example, less than 95 percent) that might be sufficient for regulatory action, particularly for drug harms.
• Without agreed-upon conventions or empirical bases for assigning prior probabilities, the prior probabilities derived from a given body of evidence will differ among scientists, resulting in different conclusions from the same data.
The probability that a given harm will be caused by a drug is a key attribute in regulatory decision-making. How sure regulators must be to take a given action varies according to the consequences of decisions. In some cases, 95 percent certainty might be needed, in others 75 percent, and in still others less than 50 percent. The Bayesian approach provides numbers that feed into that judgment (Kadane, 2005).
Despite these advantages, one of the weaknesses of Bayesian calculations is that there is no unique way to assign a prior probability to the strength of external evidence, particularly if that evidence is difficult to quantify, such as biologic
plausibility. Although it may be impossible to assess subtle differences in prior probability, even crude distinctions can be helpful, such as whether the prior evidence justifies probability ranges of 1–5 percent, 15–50 percent, 60–80 percent, or 90+ percent. Such categorizations often provide fine enough discrimination to be useful for decision-making. In the absence of agreement on prior probabilities, “non-informative” prior distributions can be used that rely almost exclusively on the observed data, and sensitivity analyses with different kinds of prior probabilities from different decision-makers can be conducted (Emerson et al., 2007; Greenhouse and Waserman, 1995). At a minimum, these prior probabilities should be elicited and their evidential bases made explicit so that this potential source of disagreement can be better understood, and perhaps diminished.
The difference between Bayesian and frequentist approaches can go well beyond the incorporation of prior evidence, extending to more complex aspects of how the analytic problem is structured and analyzed. Madigan et al. (2010) provide a comprehensive suite of Bayesian methods to analyze safety signals arising from a broad range of study designs likely to be employed in the postmarketing setting.
When new information arises that puts into question a drug’s benefits and risks, FDA’s decision-makers often face sharp disagreements among scientists over how to interpret that information in the context of pre-existing information and over what regulatory action, if any, should be taken in response to the new information. Such disagreements are often unavoidable, and moving forward with appropriate decision-making is difficult if the underlying reasons for them are unknown or misunderstood. The committee identified a number of reasons for the disagreements about scientific evidence that occur among scientists. Those reasons, which are listed in Box 3-1, are discussed below.
Different Prior Beliefs About the Existence of an Effect
People’s beliefs about the plausibility of an effect of a drug are determined, in part, by their knowledge and interpretation of prior evidence about the drug’s benefits and risks (Eraker et al., 1984). That knowledge shapes their responses to new evidence. Prior evidence can come directly from earlier clinical studies of the drug’s effects, from studies of drugs in the same class that demonstrate the effect, and from information about the drug’s mechanism of action. Newly observed evidence might be interpreted as resulting in a higher chance that a drug is harmful if earlier studies have also demonstrated the harm. If other drugs in the same class have been associated with a particular adverse effect, the drug has a higher prior probability of causing that effect than a drug in a class whose mem-
Why Scientists Disagree About the Strength
of Evidence Supporting Drug Safety
1. Different weights given to pre-existing mechanistic or empirical evidence supporting a given benefit or risk.
Quality of the New Study
2. Different views about the reliability of the data sources.
3. Different confidence in the design’s ability to eliminate the effect of factors unrelated to drug exposure.
4. Different views on the appropriateness of statistical models.
Relevance of the New Evidence to the Public Health Question
5. Different views of the hypotheses needing evaluation.
6. Different assessments of the transportability of results.
Synthesizing the Evidence
7. Different ideas about how to weigh and combine all the available evidence from disparate sources relevant to the public health question.
Appropriate Regulatory Response to the Body of Evidence
8. Different opinions among scientists regarding the thresholds of certainty to justify concern or regulatory action, which can affect how they view the evidence
bers have not produced such an effect. If a drug has a mechanism of action that has been implicated in a particular adverse effect, it has a higher prior probability of causing that effect than a drug for which such a mechanism is implausible. For example, the prior probability that a topical steroid would produce significant internal injury would be very low because what is known about the absorption, metabolism, and physiologic actions of topical steroids makes it difficult to imagine how such an injury could occur, but the prior probability of an adverse dermatologic effect would be much higher.
Evidential bases of prior probability can take two forms: an assessment of the evidence supporting the mechanistic explanation of a proposed effect and the cumulative weight of previous empirical studies. Marciniak, in the FDA Office of New Drugs (OND) Division of Cardiovascular and Renal Products discussed mechanism directly in a letter that was provided for a July 2010 FDA Advisory Committee meeting related to Avandia (Marciniak, 2010):
Others have speculated that rosiglitazone could increase MI [myocardial infarction] rates through its effects upon lipids or by the same mechanism whereby it increases HF [heart failure] rates. There are no clinical studies establishing these mechanisms. We propose that there is a third mechanism for which there is some evidence from clinical studies. The third possible mechanism is the following: The Avandia label states that “In vitro data demonstrate that rosiglitazone is predominantly metabolized by Cytochrome® P450 (CYP) isoenzyme 2C8, with CYP2C9 contributing as a minor pathway.” The published literature suggests that rosiglitazone may also function as an inhibitor of CYP2C8. … Allelic variants of the CYP2C9 gene have been associated in epidemiological studies with increased risk of myocardial infarction and atherosclerosis. … Recently, CYP2C8 variants has also been associated with increased risk of MI. … CYP2C9 and 2C8 catalyze the metabolism of arachidonic acid to vasoactive substances, providing one potential mechanism for affecting cardiac disease. Interference with cigarette toxin metabolism is another. … Rosiglitazone effects upon CYP2C8 and CYP2C9 could be the mechanism for its CV adverse effects. Regardless, there are several possible mechanisms for CV toxicity of rosiglitazone.
The above paragraph describes a mechanism that is fairly speculative, as labeled. There is no suggestion or claim that such a mechanism would definitely or even probably produce adverse cardiovascular effects. Rather, this particular exposition is exploratory and aimed at establishing that such an effect is possible rather than probable. Those who have a good understanding of this particular set of pathways might interpret the explanation differently and establish a different starting point for the probability of such an effect. It is unlikely, though, that on the basis of such evidence general consensus could be garnered for a high prior probability of effect.
Mechanistic explanations generally provide weak evidence when they are offered post hoc to support an observed result. They carry more weight when they are proposed before such an effect is observed. Misbin (2007) raised questions about the safety of rosiglitazone on the basis of its effects on body weight and lipids—both well-established risk factors for cardiovascular disease—long before any risk of myocardial infarction (MI) was seen in any studies.
Another, more subtle way in which mechanistic considerations can affect inferences is in the choice of endpoints, as illustrated in discussions by Marciniak, from the FDA Office of New Drugs (OND) Division of Cardiovascular and Renal Products, of the wisdom of combining silent and clinical MIs into a single endpoint (Marciniak, 2010):
There is additional evidence from RECORD [the Rosiglitazone Evaluated for Cardiac Outcomes and Regulation of Glycemia in Diabetes trial] that the MI risk for rosiglitazone is real rather than a random variation:
We prospectively excluded silent MIs from our primary analysis because we had concerns that silent MIs might represent a different disease mechanism than symptomatic MIs, e.g., could they represent
gradual necrosis from diabetic microvascular disease rather than an acute event with coronary thrombosis in an epicardial coronary artery?
Whether or not silent and clinical MIs should be combined—a critical decision in assessing the evidence—is framed here as contingent on whether or not they represent different manifestations of the same pathophysiologic process. What is important to recognize is that the numbers arising from an analysis that excludes silent MIs are only as credible as the underlying mechanistic explanation. This example shows how a mechanistic explanation can affect the analyses, especially exploratory analysis, even if it is not explicitly invoked as an evidential basis of a claim.
Even if two scientists agree about what evidence new data provides, if they have different assessments of the strength of prior evidence they might disagree about the probability of a higher drug risk. Such a disagreement might appear outwardly to be about the new evidence when in fact the disagreement is about the prior probability. That phenomenon is captured quantitatively by Bayes theorem, as previously noted (Fisher, 1999), which can use sensitivity analyses with different priors to illustrate the plausible range of chances that the drug induces unacceptable safety risks.
Quality of the New Study
Standard approaches to evaluating evidence rely on the use of evidence hierarchies, which traditionally emphasize the type of study design as the main determinant of evidential quality; an example is the US Preventive Services Task Force guidance (AHRQ, 2008). Many scientists judge a study on the basis of its type of design above all other considerations. The type of study design, however, is only one of the factors that should be taken into account in assessing the quality of a study and thereby the quality of the evidence from the study. In addition to the type of study, such other aspects as the source and reliability of the data, study conduct, whether there are missing or misclassified data, and data analyses influence the quality of the evidence generated by a study. Some of these reflected in the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to evidence assessment (Guyatt et al., 2008). Those factors and their role in disagreements among scientists are discussed below.
Different Views about the Reliability of the Data Source
Most evidence hierarchies assume that data in a study are generated for research purposes and that outcome measures are specified in advance. Much postmarketing research about a drug’s benefits and risks, however, whether an RCT or an observational study, depends at least in part on data gathered with systems developed for other purposes. For example, billing data that happen to
include diagnoses or RCTs that were designed to assess outcomes other than safety-related outcomes could be used in the postmarketing setting. One source of disagreement among scientists is the reliability of the data sources that are used for a study.
Data are gathered and captured electronically in many settings and provide important evidence about exposures, covariates, and outcomes. A number of health-monitoring systems or (linked) databases are or could be used for drug-or vaccine-safety investigation, including the Adverse Event Reporting System (AERS), Sentinel, Vaccine Safety Datalink, Post-licensure Rapid Immunization Safety Monitoring, the Health Maintenance Organization (HMO) Research Network, health plan records, data from the Centers for Medicare and Medicaid Services (CMS) and the Department of Veterans Affairs (VA), disease registries, pharmacy records and prescriber databases, hospital administrative databases, and cohort studies.
Concerns related to reliability include concerns about the measurement quality, completeness, and accuracy of the data. The conduct of high-quality studies using electronic data requires local knowledge about how care is delivered, how the computerized systems operate, and how they change. Problems with data quality affect the quality of evidence, decreasing precision and increasing bias in a study. (Formal definitions of bias and precision are presented later in this chapter.) Some of the issues are discussed below.
The quality of databases is variable. In the case of the AERS database, for instance, reporting of adverse events is incomplete, and the quality of the information about the adverse events that are reported may be poor. There is no information about the denominators, such as the number of people taking a drug, which is necessary for estimating event rates. Despite their limitations, however, a database of adverse-event reports can provide sufficient evidence of a drug’s harm, especially when the reported harm is rare, unrelated to the indication for using the drug, and distinctive enough for most of or all the reports to be attributed to the drug. More than half the 36 drugs withdrawn from the US market since 1956 were withdrawn on the basis of safety evidence from case reports like those included in AERS (Saunders et al., 2010). For example, after a request by FDA, the manufacturer of the statin cerivastatin (Baycol®) withdrew it from the market because of the number of reports of rhabdomyolysis (a breakdown of muscle fibers that can result in kidney failure) (Furberg and Pitt, 2001; Lanctot and Naranjo, 1995; Staffa et al., 2002). The number of reports of that adverse event occurred at more than 50 times the frequency associated with other drugs in the same class and was unrelated to the indication for cerivastatin therapy (Staffa et al., 2002).
When databases are used for dual purposes, changes for one purpose may affect the quality of data used for the other. Hospitals, health plans, and other sources of care often change computerized systems, typically to optimize them for administrative purposes. With each of those changes, the quality of the data
and their ability to capture events, exposures, or covariates for investigations of drug safety can change as well. Estimates of the reliability and validity of various methods and approaches may not stay accurate when the underlying systems change. Therefore, if data are to be used for drug safety research, continuing quality-control analyses are essential.
Considerations Regarding Data on Drug Exposures
Closed systems of care, such as health plans, tend to provide the most complete information on medical care. The denominators of membership are known, and entry into and exit from the cohort of patients can be reasonably well defined, allowing calculation of the risk of adverse events. Health insurance databases are likely to capture most drug exposures and serious adverse events requiring medical care, although the complete ascertainment of outcomes may require the use of multiple administrative files.
Computerized pharmacy files are likely to provide more complete and accurate information about drug use than medical records or patient surveys. Information about the date of a prescription, the number of days of supply, and the refill date for a chronic-disease medication often permit an assessment of drug exposure during a specific time window, assuming that the patient is taking the medication.4 Computerized drug data will provide less reliable and valid estimates of exposure to medications that are used as needed and medications that are available over the counter. Drug-use information might be missing for inpatient medications, medications received from family members or friends, and medications purchased outside the system of care.
Considerations for Data on Outcomes
Problems arise in efforts to capture information about events of interest. The more disparate the sources of care, the more dangerous it is to rely on a single administrative data source for the conduct of a study. In the setting of health plans that own hospitals, inpatient diagnostic codes are generally available in administrative records, but codes for out-of-plan hospitalizations (such as a hospitalization that occurs when a patient is away from home) might not be available unless billing records include sufficient diagnostic information. Similarly, medical records of veterans might be complete in the VA’s data systems for hospitalizations in the VA system of hospitals but might lack information on hospitalizations in non-VA hospitals or on drugs prescribed by non-VA providers.
Whether the data come from a single source or multiple sources, the diagnostic codes used in the administrative files are subject to error. For instance, a hospital discharge diagnosis of hypertension has been associated with a decreased risk of in-hospital death even though hypertension is a risk factor for adverse
4Except for drugs that may have resale value on the street, patients typically do not refill prescriptions for drugs that they are not taking (Lau et al., 1997).
cardiovascular outcomes, including death (Jencks et al., 1988). That paradoxical finding arises from the fact that there are fewer discharge diagnoses on fatal hospitalizations and such diagnoses as hypertension tend to be omitted; as a result, patients discharged alive will probably have more discharge diagnoses than those who died during their hospitalization. In one study, a comparison between hospital discharge diagnoses and six major cardiovascular events adjudicated according to accepted diagnostic criteria revealed levels of agreement between 44 percent and 86 percent (Ives et al., 1995). Diagnostic coding matters for reimbursement, so some diagnoses, such as heart failure, appear with surprising frequency in the absence of evidence (Psaty et al., 1999). In a recent study of the association between opioid use and fracture risk, only 67 percent of fractures identified with administrative diagnostic or X-ray data were actually incident fractures (Saunders et al., 2010). Agreements between death-certificate causes of death and adjudicated deaths based on medical records, interviews with witnesses, questionnaires to physicians, and autopsies are only modest—coronary heart disease: kappa statistic, 0.61, 95% CI, 0.58–0.64; death from stroke: kappa statistic, 0.59, 95% CI, 0.54–0.64 (Ives et al., 2009).
Diagnostic codes can also change. The International Classification of Diseases codes are used worldwide and provide consistency in information on effects, but the codes are periodically updated, and the updates can affect health data both within a study over time and in comparisons among different studies. In addition, nonstandard definitions of endpoints, economic incentives for listing particular diagnoses, and insufficient detail about key variables of interest can affect data quality.
Data Quality in Primary Research
Issues about data quality can arise in research even when data-gathering and quality control are parts of the design. The problem occurs particularly in the classification of cause-specific events. Questions and disputes over the extent and possible effect of data-quality issues arose in the discussion of the RECORD trial with respect to rosiglitazone-related risks. The questions included whether events were properly adjudicated and recorded and whether events were missed, whether followup was sufficient, whether handling of withdrawals was appropriate, how disagreements were settled, whether unclear or incomplete case-report forms were handled properly, and whether cotreatments were recorded. Below are comments bearing on some of those issues and noting the role of judgment in assessing the likely effects of the problems (Marciniak, 2010):
Our assignments regarding bias involve varying levels of subjectivity. While we believe we have strong, documented justifications for some assignments, such as our unacceptable handling [of] cases, for other assignments our judgment calls are not unquestionable. For this reason we have provide[d] copies of the relevant case report forms (CRFs—redacted for personal and institutional identifiers) for a selection of problem cases in Appendix 1. We have also provided short sum-
maries of many of the other problem cases in Appendix 3 and short summaries of all cases for which we made a different CV death, MI, or stroke assignment than GSK in Appendices 5-7. …
Our review of the trial conduct appears to confirm that, as the protocol issues suggest, biases did arise in RECORD. The trial conduct issues reinforce our belief that RECORD can not provide any reassurances regarding rosiglitazone CV safety.
In contrast, Ellis Unger, deputy director of the Office of New Drug-I in OND, disagreed with the judgments made by Marciniak (in the OND Division of Cardiovascular and Renal Products) (Unger, 2010):
For the upper bound of the 95% CI for the relative risk of death to exceed 1.2, there would need to have been a differential of approximately 16 deaths between subjects lost to follow-up in the rosiglitazone and control groups. …
Such striking imbalances may be plausible, but they seem highly unlikely. I disagree, therefore, with Dr. Marciniak’s interpretation of all-cause mortality. I deem the results of RECORD to be reassuring with respect to all-cause mortality, an endpoint essentially unaffected by ascertainment bias in an open-label study.
There may be some merit in re-adjudicating MIs in RECORD; however, there are reasons why diagnostic criteria are strictly defined and enshrined in the protocol, reasons why adjudication committees are actually committees (i.e., more than a single individual), and reasons why scrupulous blinding is essential for these committees to perform their duties correctly.
My view of MIs in RECORD is that the findings are neither reassuring nor concerning. I am not surprised that, using modified criteria, Dr. Marciniak was able to increase the number of MIs by 18%; I am somewhat concerned that nearly all of them were in the rosiglitazone group.
What is particularly important to note about these disputes is that they have a direct connection with the estimated quantitative risk and its attendant uncertainty. However, such disputes cannot always be settled by reviewing records or by repeating procedures. They typically involve some degree of missing information whose potential impact can be assessed only with sensitivity analyses. How much the various assumptions should be allowed to vary in those analyses is a matter of judgment. So data-quality issues can have a central, sometimes irresolvable role in creating disagreement among scientists about numerical results; at best, the plausible range of estimates that would be consistent with their qualitative disagreement can be calculated with sensitivity analyses.
Confidence in a Design’s Ability to Eliminate Bias
The science of drug safety concerns questions of causal, not just statistical,
relationships. That is, the important drug safety question is whether drug exposure actually causes an adverse outcome, not simply whether such an outcome occurs more frequently in people who choose to take the drug. Whether or not an observed increase in risk is likely to be causally related to drug exposure depends on a variety of non-statistical judgments about the design of the study, the analytic methods, and the underlying biologic mechanisms. Those judgments focus on whether something other than the drug itself could be causing the increase in risks—or in benefits. If the evidence pointing to such a relationship has been generated by a well-designed, well-conducted clinical trial in which drug treatment has been randomly assigned and there is adequate size and time for adverse effects to appear, confidence is typically fairly high that the difference in drug exposure is the cause of any differences in benefits or risks. However, if deviations from initial randomization occur (such as that caused by dropouts, missing data, or poor adherence), the conclusion of causality will rest heavily on judgments about the appropriateness of analytic procedures, the plausibility of alternative causes given the study designs, and knowledge of drug action and the natural history of disease. These issues assume even greater prominence in the analysis of observational studies. Those considerations are not always objectively quantifiable and can be the subject of disagreement and debate among scientists.
Two main determinants of the inherent quality of a study are precision and bias (Figure 3-1). Precision is the magnitude of variability in an estimated benefit or risk that can be ascribed to the play of chance. It is the only determinant that has a clearly quantifiable effect on the strength of evidence. The confidence limits or intervals around an estimate of benefits or risks are a quantitative indication of the precision of a study. The more precise a result, the stronger the evidence it will provide for one hypothesis versus another. In practice, study sample size is the prime determinant of precision—a large study produces an estimate of benefits or risks that has a small confidence interval, indicating high precision.
FIGURE 3-1 Illustration of the contributions of different types of errors to the average effect of a drug in a study, the true effect of a drug in a study setting, and the true effect of a drug in nonstudy settings. The total error is the difference between the effect of a drug observed in the study and the true effect of the drug in nonstudy settings. If the bias is large, the confidence interval around the average effect in a study (represented as the random error) will not include the true effect in the study setting.
aThe average effect in the study is a hypothetical value that would be seen if the study were conducted multiple times.
Bias is the difference between the average effect of many hypothetical repetitions of a given study and the true effect in the population being studied. If the study draws research participants randomly from a target population (that is, the population likely to be prescribed the drug), the quality of evidence is determined, in part, by the degree of bias in the results. Unlike precision, bias cannot be eliminated by increasing sample size; only proper design or analysis can control or eliminate it. The presence of bias is not apparent in the numerical results of a study; it can only be discerned from close examination of the design and conduct of the study, and even then it may not be evident. A study without bias is said to have high internal validity. The three main types of bias that affect the internal validity of a study are confounding, selection bias, and information bias, which are described in Box 3-2.
Confidence in the Transportability of Results
The Concept of Transportability
A study estimate of the benefit or risk associated with a drug can deviate from the results that patients would actually experience in wider clinical practice if the study participants were not representative of the wider target population. That disparity can occur when a study is conducted in hospitalized patients but the results are used to estimate the risks in outpatients or when a study is conducted in patients who do not have comorbidities or cotreatments but the drug will be used in patients who have both. The transportability5 of study results, also known as external validity or generalizability of a study, is determined by the difference between the effect seen in the people studied and those in the wider target population.
The concept of transportability captures what is at stake in the traditional efficacy-versus-effectiveness distinction (Gordis, 2004). Traditionally, efficacy is a measurement of the beneficial effect of the drug with respect to a specific endpoint of interest under conditions that are optimized to favor an accurate assessment of the drug’s benefits and risks. Effectiveness is a measurement of the beneficial effects of a drug as it is used in the less controlled conditions of
5The term transportability is used in this report, rather than external validity or generalizability, because the committee thinks that it better reflects a nonbinary characteristic. Different effects can occur in a variety of settings, and study results may be transportable to some populations or settings but not others, so transportability may not be a simple binary property.
Explanations of the Three Main Types of Bias
Confounding occurs when the populations compared in a study differ in important predictors of the outcome being studied other than an exposure of interest (such as exposure to a drug), that is, when another risk factor is associated with both the exposure and the outcome of interest and is a cause of the outcome. For instance, a disease state may affect both the use of a drug and the clinical outcome of interest.
Selection bias results when the exposure affects participation (“selection”) in the study or analysis and selection is associated with the outcome of interest. For example, if the use of a drug increases both the risk of harm and the probability that people using the drug will drop out of the study and be lost to follow-up, the risk of harm is likely to be underestimated because the people whose data are used in estimating the incidence of the adverse event (that is, those not lost to follow-up) are less likely to have experienced the harm simply because they remained in the study.
Other types of selection bias that affect estimates in randomized controlled trials and observational studies include missing data and nonresponse bias, healthy-worker bias, and self-selection bias (Hernán and Robins, 2012). Although the terms selection bias and confounding are sometimes used interchangeably outside epidemiology, it is valuable to use the terms to refer to the two different types of bias (see Figure 3-1).
Information bias, caused by certain patterns of measurement error, occurs when “the association between treatment and outcome is weakened or strengthened as a result of the process by which the study data are measured” (Hernán and Robins, 2012). Errors in measuring and classifying exposures, outcomes, and confounders can influence the strength and direction of effect estimates.
clinical practice. However, neither efficacy nor effectiveness is an absolute concept, and their distinction is less clear than commonly supposed. For example, a double-blind RCT conducted in one country might produce an estimate of a drug’s efficacy under conditions that might be optimal for that country but not relevant or applicable to the United States. There is no one unique set of “real-world conditions”—estimates of a drug’s effectiveness may vary among many
populations and settings. The appropriate and informative question about a study is not whether it is “generalizable” in the binary sense, but to which populations and settings its results are transportable, to what degree, and what the determinants of that transportability are.
Nontransportability is caused by different distributions of “effect-modifiers” in the study and target settings (Hernán and Robins, 2012). For example, if women are more likely than men to experience an adverse effect as a result of taking a drug, sex would be an effect-modifier. Effect-modifiers may include characteristics of the patients (such as severity of disease or comorbidities), nonadherence, cotreatments, and cointerventions, such as the monitoring that typically takes place in clinical trials. For example, the risk of an adverse effect may be lower in a study in which patients are closely monitored than in a setting in which such monitoring is not part of clinical care. Variations in dosage and administration of the treatment may also present different or additional risks relative to those identified in the trial (Weiss et al., 2008).
It is important to note that the scale upon which the effect-modification is measured is important. The public health question typically depends on the degree of additional absolute risk incurred by drug exposure. If two populations are at different baseline risk for an adverse effect, a relative risk of 2 will be more dangerous for the high-risk group than those at low initial risk. This will not show up as effect-modification on the multiplicative scale, which is most often used in epidemiology, but it will be effect modification on the additive scale, the scale relevant to public health decisions. So if multiplicative models are being used for analysis, close attention must be paid to the variation in baseline risks from one population to another when transportability is assessed.
The assessed risks in a given population can differ according to how an adverse effect is elicited from the patient. Studies that depend on passive reporting of adverse events versus those that ask patients about specific adverse events can affect the reported frequency several-fold (Bent et al., 2006; Ioannidis et al., 2006).
Assessing the transportability of the results of any study requires clinical, pathophysiologic, and epidemiologic knowledge of the factors that can change a drug’s benefits or harms and of how the factors are distributed in the study and in community settings. RCTs often do not have adequate power to detect such effect-modifiers statistically, and relevant effect-modifiers (such as co-treatments) may be absent or unmeasured. In the absence of such information, conducting a study in the community setting is the best way to obtain direct knowledge about a drug’s effect in routine clinical practice. In the absence of high-quality information from a community-based setting, disputes about transportability—the relevance of study information to the public health context—can be among the most difficult to resolve because the empirical evidence base may be thin and claims based on clinical experience or claimed knowledge of biologic mechanisms hard to adjudicate. The experience with the RECORD trial shows the complexity of
this issue; some criticisms of the RECORD trial arose because of weaknesses that could be partly attributed to the attempt to conduct it in pragmatic fashion in a community setting.
Transportability is not typically treated as a formal source of error and is not taken into account in traditional evidence hierarchies, which focus almost exclusively on precision and bias. But using a public health perspective requires that the focus be on the effects of a drug as it is used in the general population. For that, transportability is a key consideration. From the perspective of FDA’s decisions, therefore, transportability should be treated as a potential source of error, with bias and imprecision, as displayed graphically in Figure 3-1. That approach leads to treating the transportability of study results as a formal contributor to the relevance of evidence for a given decision, rather than as a minor qualifier.
Issues related to transportability were raised repeatedly (using the more familiar term generalizability) in the FDA briefing document for the rosiglitazone hearings. One of FDA’s statistical reviewers questioned the relevance of research done with the UK General Practice Research Database (GPRD) (Yap, 2010):
While the GPRD database captures information for a large number of subjects, the generalizability of these data to the U.S. population might be difficult given varying prescribing practices, risk factors, and medical practices.
Critiquing a study that used a multistate Medicaid database, the FDA statistical reviewer stated (Yap, 2010):
Cohort eligibility required that a patient have at least one inpatient claim. This led to a huge reduction in cohort size from approximately 307,000 individuals to approximately 95,000. In addition, the cohort was restricted to patients receiving Medicaid services. Findings from this restrictive cohort might not be generalizable to the intended population. … The diabetes population studied comprises mostly older and generally sicker patients thus raising concerns of generalizability of results to healthier and younger diabetic populations.
Finally, the transportability of another study was criticized for having been done in Canada (Yap, 2010):
The population studied comprises patients aged 65 or older residing in Quebec, Canada. Therefore, the results reported in the publication cannot be generalized to a population of patients below 65 years of age. In addition, results might not be fully generalizable to non-Canadian population given varying baseline characteristics and differences in access to care.
In the text above, the evidential basis for claiming nontransportability is unclear at best, and therefore difficult to adjudicate. Such assertions may be
reasonable, but whether they should be accepted and how much they should affect the assessment of a study require more detailed explanation of why the differences noted would be expected to modify the drug effect and by how much. For example, is it literally true that evidence derived from people over 65 years old cannot be applied to those who are younger? What is the evidence for that claim, and how big is that effect? What if the target population had an age range of 60–64 years? Those are the kinds of questions that must be asked because differences between the study and target populations will always exist, even if the differences are only between past and future members of the same community.
In summary, if the “true effect of a drug” is defined in public health terms as its benefit–risk profile when it is used in medical practice in the general population of patients for whom the drug is used, or on a subset of the population, all three sources of error—bias, imprecision, and nontransportability—must be considered as contributing equally to the relevance of evidence generated by studies. Claims of nontransportability should be supported with evidence that the differences between settings or populations would be expected to introduce different effects that are clinically meaningful.
Randomized Controlled Trials Versus Observational Studies
Observational studies are a major source of evidence related to drug safety and are playing an increasing role in FDA’s oversight of drug safety (Hamburg, 2011). Such designs play a relatively minor role in establishing drug efficacy in the preapproval stage in FDA, where RCTs are regarded as fulfilling the statutory requirement of “adequate and well-controlled” studies to support a marketing claim. However, as discussed in this section, the relative value and quality of evidence from the two classes of designs can be quite different in connection with efficacy vs safety endpoints.
The different quality attributed to evidence from observational and randomized designs was a central aspect of the rosiglitazone debate. It is often stated that ORs less than 2, and certainly less than 1.5, cannot be reliably regarded as different from unity if they are generated by an observational study. That claim is based on a sense that there are unknowable and uncontrollable biases in all observational studies that are not discernable or controllable even with close examination of study details. The issue is described in the following passage from the Avandia memorandum from Dal Pan, director of the FDA Office of Surveillance and Epidemiology (Dal Pan, 2010), in which this viewpoint is contested:
The results of the observational studies strengthen the concern over the risks of rosiglitazone, especially when compared to pioglitazone. Observational drug safety studies are often criticized because they lack the experimental design rigor of a controlled clinical trial. Specifically, there is often concern that patients who are prescribed a particular medicine are different from those who are prescribed an alternative treatment, in ways that may be correlated with the outcome of
interest. This phenomenon is known as channeling bias, and is often a concern when measures of relative risk are below 2.0, when the effect of unmeasured confounders could account for the observed findings. While this concern is generally valid, it should not be automatically invoked to dismiss the results of observational studies in which the measure of relative risk is below 2.0. Data from the CMS observational study, for example, indicate that rosiglitazone and pioglitazone recipients were similar with regard to multiple cardiac and non-cardiac factors, a finding that suggests minimal channeling bias. Furthermore, the risk estimates from the observational studies are generally similar to those from the meta-analyses of clinical trials. Thus, dismissing the results of the observational studies simply because the observed measures of risk may be due to channeling bias may not be appropriate.
Traditional evidence hierarchies that rely on the type of study design to classify evidence generally focus on the strengths and weaknesses of those designs with respect to the evaluation of therapeutic efficacy (Barton et al., 2007; Owens et al., 2010). Study designs are ranked according to their capacity to generate unbiased evidence about efficacy endpoints, and considerations of transportability are given no weight.
The RCT design, in theory, produces the highest confidence that observed differences are caused by drug exposure, not by ancillary characteristics that might be associated with drug exposure. Such ancillary characteristics are known as confounders. In the context of a perfectly designed and conducted RCT—one without patient dropout, missing data, or nonadherence—causal inference effectively becomes statistical inference. That is, there is confidence that any quantitative differences between groups in the endpoints evaluated were due to the randomized intervention. The likelihood that a statistical hypothesis about association is true becomes equivalent to the likelihood that a hypothesis about causality is true.
As conduct deviates from ideal design, however, the certainty about a causal hypothesis decreases with the decreasing certainty that the design or analysis has adequately “controlled” for other causal factors. Such deviations include patient dropout, crossover between treatment arms, loss to followup, missing data, nonadherence to treatment, differential cotreatment, and differential measurement (Dal Pan, 2010). Those aspects of study design and conduct must be assessed to determine the evidential value of an RCT, especially if oversight of the study might have been complicated by its being conducted at multiple, overseas sites (Frank et al., 2008; GAO, 2010a, 2010b; Greene et al., 2006; Manion et al., 2009). Observational studies are affected by similar issues, with biases induced by treatment selection in place of those caused by deviation from treatment assignment.
Various characteristics of safety endpoints, with other constraints, may favor the strength of observational evidence over evidence from RCTs in generating valid and reliable evidence needed for benefit–risk assessments in the postmar-
keting setting. In a standard RCT, participants are randomly assigned to different treatment arms. The groups are then compared for observed risks of developing a particular health outcome, such as myocardial infarction. The most important property of an RCT is that—in large samples—the baseline distribution of risk factors, both known and unknown, is expected to be equal among groups. Under a number of assumptions—including complete event ascertainment, no differential loss to followup, and perfect treatment compliance—the estimate of the drug’s efficacy is unbiased and provides high confidence that any observed association between the drug and the efficacy endpoint is due to the drug.
Observational studies of efficacy, in contrast, are often subject to a variety of biases, the most common of which is known as confounding by indication (Vandenbroucke and Psaty, 2008). Confounding by indication, also known as “channeling bias”, occurs when treatment assignment is based on a risk factor for an outcome. In clinical medicine, patients are typically treated to improve their chances of a beneficial outcome. That makes it difficult to separate the effect of the drug itself from that of the patient’s condition that led to the drug’s use, that is, the drug’s indication. When physicians treat some patients differently for reasons related to their risk for various outcomes, confounding by indication is likely. For example, if sicker patients choose medical care for a given condition more often than surgery because they are afraid that they will not survive the operation and if sicker patients are more likely to die whether or not they receive surgery, observational studies of surgical versus medical care will show that surgery is safer than medical care even if they are equally efficacious.
Confounding by contraindication is the corresponding concern in studies that evaluate safety endpoints, although it is not as common as confounding by indication. If an adverse effect of a drug is known, physicians might avoid prescribing it for patients who are at higher risk for that effect (for example, the use of nonsteroidal anti-inflammatory drugs and GI bleeding, or the use of aspirin and anticoagulants together). If the use of the drug increases the risk of a particular adverse effect and patients who were at higher risk for that effect avoid taking the drug, the results will be biased toward the hypothesis of no effect. The findings of such a study may mistakenly indicate that the use of the drug does not increase the risk of the adverse event or, worse, that it prevents it. Such a treatment approach is in the patient’s best interest, but it makes observational studies of the harms associated with drugs difficult to conduct well. Confounding by contraindication, however, is not a major concern in studies of unexpected adverse events. If the risk itself or the factors that affect it are unknown, treatment cannot be based on avoidance of the risks (Golder et al., 2011); although it could be based on correlates of those unknown risks (for example, age, disease severity).
Empirical evidence suggests that the findings of observational studies of harms can be similar to the findings of RCTs for the same drugs and harms (Vandenbroucke, 2006). In a recent study of safety outcomes, Golder et al. (2011) compared the results of meta-analyses of RCTs with the results of meta-analyses
of observational studies. Their meta-analysis of the meta-analyses included 58 drug–adverse event comparisons. The ratio of ORs was used as the method of comparison, and the RCTs were associated with only a slightly higher estimate of risk (ratio of odds ratio, 1.03; 95% CI, 0.93–1.15). Of the 58 comparisons, 64 percent agreed completely (same direction and same level of significance), although some of the studies had low statistical power. This large meta-analysis provides empirical support for the claim that in large samples, observational studies can yield findings on adverse events that are similar to those of RCTs. However, more empirical research is needed into the factors that determine concordance between observational and randomized studies of harms.
A number of features of safety endpoints and the postmarketing context strengthen the role of observational studies in generating valid and reliable evidence needed to answer public health questions of interest. Differences in the frequency of efficacy and safety endpoints and the timescale on which they occur affect comparative judgments about the quality of evidence generated by RCTs and observational studies. Adverse effects resulting from the use of a drug may be severe but rare, and the sample size of preapproval RCTs, or achievable postmarketing RCTs, may be insufficient to detect rare or delayed outcomes. Preapproval RCTs are also likely to miss adverse effects resulting from chronic use or those arising after a long latent period, whereas observational studies, particularly those based on existing data, can typically provide longer followup. Observational studies based on data sources collected from large populations with long follow up can often report a greater number of adverse events than typical RCTs. However, any design with long followup, whether it is concurrent or non-concurrent, needs to be scrutinized very carefully for the extent and pattern of missing data; over time, the problems with selective retention or reporting can be substantial in all designs.
A second way in which the strength of evidence from an RCT for safety can be weakened is when the adverse effects are unknown or unforeseeable. Such endpoints by definition cannot be pre-specified (Claxton et al., 2005). It has been shown that quality and consistency of the reports and measurements of non-specified endpoints is often poor (Lilford et al., 2003; Thomas and Petersen, 2003). This problem will affect prospective observational studies as well, but it nevertheless can narrow the internal validity gap between randomized and non-randomized designs.
Potential confounders of efficacy endpoints may differ from confounders of safety endpoints; if only the former are measured, effect estimates of safety endpoints may not be appropriately adjusted for confounders (see for example Camm et al., 2011). This is likely to affect observational studies more than RCTs, whose results will typically require less adjustment, if any, depending on the extent and patterns of deviations from randomization and degree and determinants of missing information.
Finally, as noted previously, the transportability of evidence from obser-
vational studies to populations of interest may be superior to that of evidence produced by an RCT. Because RCTs often restrict eligibility to patients for whom the anticipated benefit is thought to outweigh (known) drug risks, RCTs are incapable of detecting adverse events that may arise only in the populations excluded from the trial, which are often characterized by a wider array of comorbidities, different disease severity, concomitant treatments, or other risk factors (such as age, sex, low socioeconomic status, poor monitoring of dose, adherence, or outcomes) that may modify the effects of treatment. Observational studies can include people who are more representative of those who receive the treatment of interest in the general population and in diverse care settings. Thus, less restrictive eligibility criteria typically used in observational studies can increase the transportability of the resulting effect estimates.
The eligibility criteria for observational studies, however, are sometimes restricted in an attempt to limit the magnitude of confounding (Psaty and Siscovick, 2010). For example, consider an observational study to compare cardiovascular risk in initiators vs noninitiators of statin therapy in a particular population. Suppose that all patients in that population who have LDL cholesterol greater than 4.9 mmol/L already receive statin therapy. The observational study should exclude current users and thus restrict participation to patients who have LDL cholesterol less than 4.9 mmol/L; otherwise, it would be difficult to adjust the effect estimate for confounding by concentration of LDL cholesterol. The desire for increased transportability in observational studies should be tempered by the need to ensure internal validity.
There is no study in which all measurements are perfectly reliable or in which many judgments have not already been made before study data are analyzed. In studies of drug safety, there is a long documented history of underreporting, selective reporting, or misclassification of harms (Ioannidis and Lau, 2001; Lilford et al., 2003; Talbot and Walker, 2004). Missing data are common in such studies, and the validity of statistical methods to account for missing data rests on assumptions that often cannot be confirmed by using the data. Data quality affects not only estimates of harms themselves but also the measurement of other risk factors for those harms, such as cotreatments, comorbidities, and patient-specific characteristics. It is often critical to understand the exact operational procedures by which harms were identified and reported and other key data recorded if one is to judge properly whether data on harms reported in a study are reliable. That degree of detailed operational knowledge is often not available to those outside the study, or, if the extent of that knowledge differs among scientists, their assessments of the reliability of any ensuing inferences may differ as well.
Disagreements About the Choice of Statistical Analysis
Judgments about the most appropriate statistical model for a given study
depend on many implicit and often unverifiable assumptions, both statistical and biologic, and scientists often disagree on the most appropriate statistical methods. Different ways of coding the same data can change their probability. For example, dichotomizing a continuous variable or combining harms of different severities can lead to vastly different estimates of effect size. The probability can depend heavily on how issues of multiplicity (that is, testing statistically for multiple endpoints) are treated. If researchers evaluate many adverse events statistically and only a few are observed to have increased risks, the strength of evidence of those adverse events depends on whether the “data” are treated as all the comparisons taken together or as each taken separately for the specific adverse events whose risks seem to be increased. Sometimes this problem is handled through multiplicity adjustments, but there is no ideal or universally agreed-on solution for knowing how much the analytic strategy should depend on patterns seen after the data have been observed.
Decision-makers cannot be expected to be expert in the many technical issues involved in statistical modeling (see Chapter 2 for more discussion of needed expertise for decision-making). The intricacies and nuances of statistical modeling highlight the importance of having inputs from several statisticians or others with deep technical understanding, just as the input from multiple scientists familiar with the content is routine. Data do not always speak for themselves—they speak through the filter of statistical models—and getting input from multiple experts in statistical analysis and modeling can be critical in understanding the extent to which the models being used are introducing clarity or distortion.
A further source of disagreements about statistical analyses is whether to analyze the data from a study according to the intention to treat (ITT) perspective or “as treated”. Assuming that all confounders are identified and well measured, the simplest approach to compare two treatments is an analysis that follows the ITT principle. In RCTs, an ITT analysis measures the effect of being assigned to a treatment; when all research participants initiate the treatment, an ITT analysis measures the effect of treatment assignment. For ITT analyses of large RCTs, only data on each individual’s treatment assignment and outcome are needed (Hernán and Hernandez-Diaz, 2012).
The observational analogue to ITT analysis needs to adjust for potential confounders. An observational ITT estimate will have only a causal interpretation as the effect of treatment initiation if all confounders have been appropriately identified, measured, and included in the analysis. Adjustment methods include stratification, outcome regression, standardization, matching, restriction, inverse-probability (IP) weighting, and g-estimation (Hernán and Robins, 2006).
Instrumental-variable approaches can also be used to estimate the effect of treatment initiation in observational studies (Hernán and Robins, 2006). An “instrumental-variable” is a variable on which exposure, but not the outcome,
depend. In an RCT, the “instrument” is the randomization itself, which determines drug treatment but by itself has no relationship with the outcome. “Natural experiments” often have an embedded instrument that causes groups to be treated or exposed in different ways unrelated to the group characteristics. The most common instrument is geography, that is, different regions of the country (or care settings) that use different treatment regimens for essentially equivalent patients. The instrumental variable method, however, relies on strong assumptions, the primary one being the validity of the instrument itself, and these always have to be examined closely.
When people drop out of a study or are otherwise lost to followup, their outcomes cannot be ascertained. As a result, regardless of whether the study is an observational study or an RCT, the ITT effect cannot be calculated directly. Loss to followup forces investigators to make untestable assumptions about why people were lost to followup. If one assumes that the people lost and not lost to followup are perfectly comparable, one would restrict the ITT analysis to participants on whom there was complete followup. A safer approach is to adjust for measured predictors of loss to followup that also predict the outcome (NRC, 2010). Such adjustments can be appropriately achieved with longitudinal outcome models by regression if the factors are non–time-varying or by inverse probability weighting otherwise.
The magnitude of the ITT effect depends on the type and patterns of nonadherence, which may vary among studies, whether they are observational studies or RCTs. Dependence of the ITT effect on nonadherence makes the effect particularly unfit for safety and noninferiority studies. One alternative to estimating the ITT effect is estimating the effect of treatment if all participants had adhered to the intended treatment regimen. In RCTs, that approach would estimate the effect of treatment if no one had deviated from the protocol. Such an effect is sometimes referred to as the effect of continuous treatment. To estimate the effect of continuous treatment, whether in observational studies or in RCTs, one needs to compare groups of people according to the treatment they actually received rather than the treatment to which they were assigned (an as-treated analysis) and make untestable assumptions about the time-varying reasons why people adhere or do not adhere to treatment. Specifically, valid estimation of the effect of continuous treatment requires that all time-varying factors that predict both adherence to treatment and the outcome of interest be measured reasonably well.
In RCTs, as-treated comparisons ignore the randomization assignment and therefore involve comparisons of groups that are not necessarily balanced with respect to prognostic factors. As-treated estimates can be confounded in RCTs. The problem with using ITT in safety analyses was noted in the FDA Avandia briefing documents (Graham and Gelperin, 2010a):
The primary analysis for RECORD was intention-to-treat (ITT), which is generally accepted as the preferred analytic method for trials conducted to show
efficacy. It is conservative in that poor study execution or inadequate follow-up will serve to make it more difficult to show a difference from the null. For purposes of safety, where a safety concern has been raised and is under evaluation, the ITT approach is protective of the drug at the potential expense of patient safety. Patients who drop out of a study and for whom outcomes might not be counted, and patients who stop the drug and hence are probably not at the same risk of a cardiovascular event off the drug as they were while on it, will bias the estimated event rates towards the null under an ITT approach. In studies for safety, the preferred analytic approach is on-treatment.
The assertion above that the on-treatment approach is preferred for safety analyses shows how difficult it is to assess such studies. Both the ITT and on-treatment approaches introduce a degree of bias related to the effect, and on-treatment analyses are not the only alternative to ITT; these are situations in which causal inference methods are most appropriate (Ten Have et al., 2008).
Another difficulty is that the predictors of adherence may be affected by whether a patient took treatment earlier in the followup. In that setting, a simple as-treated analysis with standard adjustment (regression) may be biased, and adjustment via IP weighting or g-estimation is required (Toh and Hernán, 2008). Those methods can be used to estimate the effect of dynamic treatment regimens (for example, take treatment A until toxicity appears, and then switch to treatment B). Instrumental-variable estimation can also be used to estimate the effect of continuous treatment. Unlike all other methods, instrumental-variable estimation does not require measurement of the joint predictors of adherence and the outcome. It is less controversial for RCTs, because the randomized assignment is a known instrument, than for observational studies, in which it must be justified (Gelfand and Mallick, 1995).
In summary, when dealing with observational or RCT data involving loss to followup and nonadherence, all analytic approaches rely on untestable assumptions, which may influence effect estimates in unknown ways. One way to assess the sensitivity of effect estimates to such assumptions is to conduct both ITT analysis to estimate the effect of treatment assignment (with and without adjustment for loss to followup) and analyses adjusted for adherence to estimate the effect of continuous treatment (via the statistical approaches mentioned). Many approaches, most notably ITT, that are deemed conservative when used for efficacy determinations can be anticonservative when used for safety analyses in that statistical signals of drug harm can be missed.
Relevance of New Evidence to the Public Health Question
In evaluating the evidence that a study provides in support of a regulatory decision, it is important to consider the relevance of the study to the public health question that motivates the decision. This section discusses aspects of a study and analysis that affect the relevance of a study to the public health question.
The relevance of a study to a regulatory decision depends on the hypotheses that the study is designed to test and on how suitable the hypotheses are for providing evidence about the public health question of interest. The questions “Does a given drug cause excess harm”? and “Does a given drug cause benefits”? seem straightforward, but need to be refined further to become testable scientific hypotheses. A testable scientific hypothesis must specify the intervention or exposure, the study population, the setting, the comparator, and the outcomes. It is rare that two studies pose the scientific question, explicitly or implicitly, in exactly the same way. For example, if studies are investigating adverse cardiovascular effects, one study could define the adverse-event endpoint as a myocardial infarction or death, and another might develop a composite endpoint that includes those plus unstable angina, hospitalization, and stroke. With all such endpoints measured in a given study, there might be disagreement about whether or how they should be combined into a composite endpoint. The timing of the adverse event relative to the drug exposure might also be an issue; the relevant time window might vary among studies, reflecting disagreement among scientists. Such disagreements are often manifested as arguments among scientists about whether particular aspects of study design are “right” or “wrong”. A better way to frame the disagreement, however, is that the different studies address different questions. The real issues, with respect to regulatory decision-making, are what the most important questions are from the standpoint of the regulatory decision. Are the questions that the study addresses similar enough to the public health questions of interest?
Trial interpretation is most profoundly affected by the underlying hypothesis in the case of “noninferiority” trials. Superiority trials for efficacy are the most familiar type of design used in the drug-approval process (Erik, 2007). The objective of a superiority trial is to generate evidence that a particular drug is superior to a comparator, which is often a placebo but could be an active treatment (Lesaffre, 2008). The incentives for high-quality design and conduct in such studies are strong because a poorly conducted study can bias the result toward a finding of no difference. Because of their incentives for scrupulous study conduct and clear interpretation, superiority trials are generally preferred for establishing efficacy.
However, when well-evaluated therapies are accepted as effective for a serious indication or condition, it can be difficult or impossible to withhold them or difficult for a new treatment to exceed them appreciably in efficacy. Therefore, a commonly used approach to the evaluation of efficacy is the noninferiority design, which attempts to show that the new experimental therapy is not worse than the standard therapy by a particular margin. The margin needs to be small enough for it to be assumed that the new therapy is still superior to placebo even if a placebo treatment is not included in the study (Fleming et al., 2011).
The FDA draft guidance on noninferiority studies, praised by the Government Accountability Office (2010c), calls the design and conduct of such trials a “formidable challenge” (FDA, 2010b). Fleming (2008) lists three conditions that
permit reliable estimates of the efficacy of an experimental therapy in an active-control noninferiority trial: the effect estimates of the standard therapy that is used as the active control should be of substantial magnitude, precisely estimated, and relevant to the setting of the current trial. From 2002–2009, noninferiority designs were used in 43 (25 percent) of 175 new drug applications (NDAs) for new molecular entities; more than half of the 43 NDAs which used noninferiority designs were for antibiotics, other drug classes for which they were used included anticoagulants (GAO, 2010c).
Recently FDA has begun to use the noninferiority trial design for the study of safety. For instance, the PRECISION (Prospective Randomized Evaluation of Celecoxib Integrated Safety vs Ibuprofen or Naproxen) trial is randomizing 20,000 patients who have osteoarthritis or rheumatoid arthritis to receive celecoxib, ibuprofen, or naproxen (Becker et al., 2009). The TIDE (Thiazolidinedione Intervention with Vitamin D Evaluation) trial is another example (Juurlink, 2010). FDA has also required a series of noninferiority safety trials of long-acting beta-agonists (Chowdhury and Dal Pan, 2010).
Noninferiority studies are particularly problematic for evaluating safety endpoints (Fleming, 2008; Kaul and Diamond, 2006, 2007). Low-quality study conduct, such as poor compliance with treatment regimens, usually biases a superiority trial toward a finding of “no difference” between treatments—a conservative bias for efficacy studies (Temple and Ellenberg, 2000). In contrast, the bias for safety studies evaluating noninferiority among treatments is anticonservative: a more dangerous drug could be incorrectly deemed to be as “equally safe” or “equally effective” relative to its comparator. Furthermore, a choice of “noninferiority margin” that is large will result in a finding that two treatments are “equally safe” even if their risks are substantially different (Fleming, 2008).
The most critical shortcoming of the noninferiority trial for safety is its fundamental logic. In the efficacy realm, a “noninferiority” verdict can imply some degree of efficacy vs placebo because the observed effect is clearly within the efficacy margin. But in the safety realm, there is no such margin, and the logic is different. A “noninferiority” verdict connotes that the degree of possible inferiority in safety, which by itself might not be acceptable, is not so great as to outweigh the drug’s benefit in some other domain, such as convenience or tolerability. It therefore has embedded within it an implicit benefit–risk calculus. The noninferiority margin encodes the degree of extra risk that is considered acceptable for the drug’s purported benefits. Whether the drug actually has such benefits or whether that degree of risk increase is indeed what should be deemed acceptable may not have been directly addressed in setting the noninferiority margin. Setting that margin is a process that is best conducted by individuals without conflict of interest and who have the requisite scientific and regulatory expertise; noninferiority trials can take decision-making out of the hands of regulators and embed the assessment of the benefit–risk balance within the logic and mathematics
of the noninferiority analysis. That approach is undesirable in that sound public policy requires that regulators make explicit and transparent assessments of the acceptability of a drug’s benefit–risk balance. The problem was noted with respect to the RECORD trial of rosiglitazone; it was stated in the FDA briefing document that “the non-inferiority design, with a clinically excessive margin of 20 percent also contributes to masking rosiglitazone risk” (FDA, 2010c). In other words, the noninferiority verdict can encode unacceptable benefit–risk tradeoffs in the statistical verdict and the noninferiority margin.
The solution to this is twofold. First, all noninferiority trials must pay close attention to both design and conduct to ensure that they do not bias results toward an equivalence, or noninferiority, finding. Noninferiority margins should be established or reviewed by non-conflicted groups with regulatory, ethics and scientific expertise. Second, and potentially more important, the binary “noninferiority” verdict of such trials should not dominate the regulatory decision-making process. Rather, the estimated difference between the treatments being compared, together with their uncertainty, should be taken as the relevant result, and regulatory decisions should be based directly on that. If the trials are combined in a meta-analysis, that is the information that is used. This is also a domain in which Bayesian approaches can be helpful which can be used to calculate the probability that either the risk or the benefit–risk margin is within an acceptable range (Kaul and Diamond, 2006).
Different Criteria for Weighing or Synthesizing Evidence Among Studies
Meta-analysis is a method of combining the results from various RCTs or observational studies. Meta-analysis synthesizes information quantitatively and provides an opportunity to evaluate the consistency of findings among studies. Heterogeneity among RCTs can also be quantified, and its sources can be evaluated and sometimes identified (Thompson and Sharp, 1999). The method of meta-analysis is an observational study design, and the units of analysis are the studies or RCTs included in the meta-analysis. Key features of high-quality meta-analyses resemble those of other observational studies and include prespecified hypotheses, entry criteria, sampling frames, data collection, and high-quality measures of exposures and outcomes. The appropriate methods of analysis require that within-trial comparisons be preserved and that when estimates based on different trials are combined, each one be weighted by their precision, that is, more precise estimates are given a larger weight.
The data from meta-analyses can be incorporated into analyses that characterize the overall benefit–risk profile of a drug. Meta-analyses can also be used to identify and validate the possibility that one group may respond to a medication differently from another group. If groups do differ in their response in important ways, the benefit–risk profiles in the groups can be estimated separately as well.
Traditionally, many meta-analyses have used published study results. The
published studies of an intervention may have included a variety of populations; the intervention may vary among studies; and, even though the outcomes may have been similar, the definitions of endpoints and elements used in a composite primary outcome may have varied from one study to another. The potential sources of heterogeneity include not only the intervention (timing, drug, dose, and duration) and the outcome (timing, type, methods of ascertainment, and validation) but study quality (concealment of randomization, crossovers, noncompliance, and blinding), patients (severity of illness, age sex, ethnicity, and setting), and the presence of cointerventions. How different is “too different” to combine is ultimately an issue of scientific judgment and one for which the reasons and supporting data must be provided. In Chapter 2, suggestions are made for how FDA can play a role in minimizing this heterogeneity to facilitate valid evidence synthesis.
Meta-analysis of RCTs for safety outcomes have difficulties if the studies originally focused on efficacy outcomes. It has long been recognized that the reporting of harms in RCTs is poor (Ioannidis and Lau, 2001), and the Consolidated Standards of Reporting Trials (CONSORT, http://www.consort-statement.org), has been expanded to facilitate their proper reporting (Ioannidis et al., 2004). Problems that afflict the primary reporting of risk outcomes inevitably affect meta-analyses. That those problems continue and are encountered by FDA was documented in a 2011 report by FDA scientists that outlined the challenges of using meta-analysis to study drug risk (Hammad et al., 2011). The problems included:
• High and differential patient dropout.
• Unblinded studies or failure of blinding.
• Inconsistent definitions and selective gathering or reporting of adverse events.
• Failure to document compliance and to measure actual drug exposure.
• Followup too short to detect important adverse events.
• Populations too homogeneous to identify important adverse events or interactions.
• Publication and reporting bias.
• Qualitative or quantitative heterogeneity.
• Incomplete and biased reporting of group results.
• Combining studies of drug “classes”, obscuring critical within-class differences.
• Relevance of unpublished data (particularly relevant to FDA), such as discordance between data accessible by FDA and other information published on the same studies.
• Effects of use of different statistical models on results, particularly if data are sparse, as they often are in the case of uncommon safety outcomes.
What the above list indicates is that the meta-analysis of safety outcomes, even from RCTs, is generally less reliable than meta-analysis of efficacy outcomes. That view was reflected in the filings of FDA in relation to rosiglitazone. Jenkins, director of OND, stated in a memorandum (2010) that
in weighing the available data for rosiglitazone the primary signals of concern arise from meta-analyses of controlled clinical trials that were not designed to rigorously collect CV outcome data and observational studies. Data from these sources provided risk estimates of a magnitude that fall well short of what has traditionally been considered a level that would support scientific and regulatory inferences, even in the face of nominal statistical significance.
Meta-analyses of observational studies have a somewhat different but no less important suite of problems. If the observational studies are designed to address risk, meta-analyses may improve the capacity to identify and characterize the potential harms associated with a drug (Golder et al., 2011). But if there is substantial risk of confounding, meta-analysis will not eliminate bias by pooling results. The potential of each study for confounding must be evaluated before it is included in a meta-analysis. Finally, both publication bias and reporting bias in observational studies can be severe, particularly if they were not designed to capture a specific adverse event (Chan et al., 2004).
Meta-analyses for drug safety can provoke intense disagreement among experts because of uncertainty about the completeness and quality of reporting—either of the risks or the studies themselves—and because of the many judgments that need to be made about whether to combine, which studies to combine and how to account for the many sources of heterogeneity and bias that can affect individual and collective trial estimates. Meta-analysis for safety is less straightforward than that for efficacy, and judgments made in meta-analyses conducted by different investigators on the same drug-safety question may result in different conclusions.
To set the stage for high-quality future evaluations of a drug’s benefits and risks, FDA could lay the groundwork for future meta-analyses performed by themselves or others. While meta-analysis is often a retrospective effort to combine evidence already gathered, steps can be taken before studies are done to facilitate meaningful data synthesis later, avoiding some of the problems of meta-analyses noted earlier. FDA is well positioned to take these steps to improve the reliability of meta-analyses for risk outcomes in the postmarketing context. For risk outcomes of concern identified in the premarketing phase or as part of the postmarketing lifecycle review of drug safety, FDA can include in the benefit and risk assessment management plan key design characteristics to raise the quality, completeness, and consistency of adverse-event gathering and reporting.
The core element of the approach described above is a prospective plan for conducting meta-analyses related to key questions of benefits and harms. A “prospective meta-analysis” of RCTs is designed with consistent approaches to
defining, capturing, and reporting adverse events to ensure the validity of later meta-analyses of the trials. Prospective meta-analyses have been designed and published (for example, Baigent et al., 2005; Psaty et al., 2009; Reade et al., 2010), and the method continues to be refined (PMA [Cochrane Prospective Meta-Analysis Methods Group], 2010). There are FDA precedents for this; plans have been described to combine data from the noninferiority safety studies required for the long-acting beta-agonists used to treat asthma (Chowdhury et al., 2011). On the basis of its rosiglitazone experience, FDA revised its guidance for the approval of diabetes medications, which now includes a requirement for prospective meta-analysis (FDA, 2008).
Prospectively planned meta-analysis can reduce the heterogeneity among studies. In the observational setting of genome-wide association studies (Psaty et al., 2009), many consortia develop prospective analytic plans; work to harmonize the outcomes, exposures, and covariates; and use meta-analysis to combine association results from many studies. The coordinated, prospectively planned meta-analyses of the genetics consortia have provided results that are as efficient as and virtually identical with those of a cohort-adjusted pooled analysis of individual-patient data (Lin and Zeng, 2010).
FDA can play a similar role in facilitating the performance of meta-analyses that use individual patient data (IPD), potentially greatly enhancing the value of this kind of evidence synthesis. IPD meta-analysis is a form of data pooling in which the analyst has access to the original data of a study instead of merely the summary effect estimates and can thus adjust for covariates and investigate group effects with stronger confounding control than possible with study-level summaries, and can adjust better for design differences between studies (Cooper and Patall, 2009; Fisher et al., 2011; Jones et al., 2009; Kufner et al., 2011). IPD meta-analyses are superior to retrospective meta-analyses that use published results of those same studies, but it is often difficult to gain access to IPD for all studies. Even if IPD can be obtained, their value is low if key variables are not defined or coded similarly among studies. The most successful IPD meta-analyses are ones that are planned prospectively and in which researchers agree a priori on data definitions and standards and on data-sharing—a process that has also been called collaborative meta-analysis (Darby et al., 2011; Davies et al., 2011). For drugs identified before marketing as requiring special scrutiny postmarketing for concerns about the benefit–risk profile, an alternative to sponsoring a single large postmarketing safety study would be for FDA, as part of the approval process or shortly after drug approval, to convene a meeting of researchers in the relevant field to agree on standardized outcome and key variable definitions, data standards, and agreements and procedures for data-sharing, with the aim of making postmarketing IPD meta-analyses for benefit and risk possible and maximally informative. That would also diminish the selective reporting and publication that have demonstrably impaired the quality of some meta-analyses (Turner et
al., 2008); 26 of the 42 studies that Nissen and Wolski (2007) used in their meta-analysis of rosiglitazone risk were unpublished at the time.
Different Thresholds for Regulatory Action
Even if scientists are not the final policy makers, they often have opinions about what regulatory decision should be made. This can shade their assessment of the strength of the evidence, particularly when that assessment involves many qualitative judgments about adequacy of confounding control, relevance of differences among studies, and the like. Even if they assess the evidence similarly, differences in their recommendations may be due to differences in their views about what level of certainty, measured in a Bayesian fashion, is sufficient for various decisions.
Standard approaches to statistical inference do not provide tools for assessing intermediate levels of certainty (for example, 70 percent certain), although steps to prevent drug harm may be justified even when only moderate certainty about that harm exists, depending on the degree of the drug benefit and magnitude and seriousness of the harm. Statistical significance, therefore, is not always sufficiently nuanced for such policy decisions.
It is also important to note that in the postmarketing setting FDA has a number of regulatory options. As discussed in Chapter 2, the options might require differing weights of scientific evidence, and therefore of certainty. For example, all other things being equal, a higher standard of evidence is required for withdrawal of a drug than for requiring a labeling change. Different opinions about what the regulatory outcome should be are evident in the testimony of scientists in the rosiglitazone case. The following is a series of exchanges in the Avandia memos and hearings that reflect such differences.
Graham and Gelperin, in the FDA Office of Surveillance and Epidemiology, in their presentation to a July 13–14, 2010, joint meeting of FDA’s Endocrinologic and Metabolic Drugs Advisory Committee and Drug Safety and Risk Management Advisory Committee on rosiglitazone (Graham and Gelperin, 2010b), noted that
• The cost of a wrong decision is not symmetric.
o If rosiglitazone increases cardiovascular risk, [a] wrong decision will cost thousands of lives.
o If rosiglitazone doesn’t increase cardiovascular risk, [a] wrong decision causes no real patient harm.
Parks (2010), director of the Division of Metabolism and Endocrinology Products, stated in a memorandum:
Although I have argued … that each of the data sources does not provide sufficient evidence for me to conclude risks outweighing benefits for rosiglitazone
to recommend its withdrawal, I believe the data sources meet the regulatory requirements to modify safety labeling for this drug.
Parks (2010) further stated:
Some might ask why I don’t just recommend the drug’s withdrawal given that the safety signal is sufficient enough to justify its relegation to second-line or even last-option therapy. After all, withdrawal would effectively eliminate any chances for the drug to continue to do harm. While I cannot dispute that fact, I believe withdrawal of rosiglitazone in the setting of scientific uncertainty is an inappropriate display of FDA’s authority to make a decision for all healthcare providers because of concern that these trained professionals can not reasonably decide on or take responsibility for the use of this drug. I am also concerned that such an action would set an unsettling precedent for future regulatory decisions or may be referenced in legal challenges to the FDA to withdraw other drugs based on meta-analyses and observational studies of similar uncertainty for drug risk.
Jenkins (2010), in OND, stated in a memorandum:
In my view the available data for ischemic CV risk of rosiglitazone, while concerning, do not rise to the level that would support a regulatory conclusion that the benefits of the drug as a treatment for Type 2 diabetes no longer outweigh its risks, which is the statutory finding FDA must reach to withdraw approval of a drug. Such decisions as this require a careful balance between placing the threshold for action too high or too low. If the threshold for action is placed too high there is greater protection against actions based on false positive results, but there is also a greater risk that patients will be subjected to undue harm by continued availability of a harmful drug. On the other hand, if the threshold for action is placed too low there is a greater chance of actions based on false positive results with the unintended consequence that physicians and patients do not have access to a safe and effective drug.
One aspect that is both interesting and admirable about those statements is that there is a reasonably clear separation between what is deemed the strength of the evidence, the degree of attendant uncertainty, and the thresholds for regulatory action. That is often not the case; an action may be portrayed as an inevitable consequence of a particular analytic result (such as statistical significance) and thereby produce pressures to distort the evidential base itself. However, what is absent here is a formal quantification of the uncertainty alluded to.
The preceding section discussed eight broad reasons why scientists can look at the same data and disagree about the credibility of a conclusion that a drug
is beneficial or harmful (see Box 3-1). There are few normative guidelines for the many issues raised in this chapter; all have to be judged in context. If the underlying reasons for disagreements are not properly expressed or elicited, however, it will be difficult to reach a consensus on the appropriate regulatory action. Quite often, a debate about one issue (such as what is an appropriate harm endpoint to consider) transmutes into debate about another (such as whether the relationships are statistically significant or what statistical model to use). To permit informed and productive discussions about potential regulatory actions or design choices, the nature of scientific differences must be identified and explicitly stated. Scientists’ views on the reasons must be explicit and documented to determine the underlying source of disagreements and to work to resolve them. For example, scientists’ views on a number of questions should be made clear to decision-makers to provide them with the context of the opinions. Clear answers to such questions should also be made available to all stakeholders to facilitate understanding of the sources of potential disagreements and the rationale behind a decision.
Those considerations often are unarticulated and are expressed in the form of disagreements about factors far afield from the actual differences. That lack of clarity makes it extraordinarily difficult for the involved scientists and decision-makers to understand the reasons for the disagreements, adjudicate them, and make decisions. Understanding the root causes of scientific disagreement about the harms of a drug is one of the most difficult and important tasks facing a decision-maker, but it is a necessary precondition for proper regulatory decisions. The three-stage decision-making process and the Benefit and Risk Assessment Management Plan (BRAMP) document recommended by the committee in Chapter 2 provide FDA with a formal mechanism for ensuring that scientists’ views and reasoning are elicited and made publicly available.
In addition to direct elicitation of the reasons for disagreements, which were well outlined in the rosiglitazone case, adherence to principles of reproducible research—an emerging set of standards or principles for presentation of complex and scientific findings—would be of substantial help to FDA in enforcing a transparency standard for all results on which regulatory decisions will be made. Principles of reproducible research have been outlined for epidemiologic research (Peng et al., 2006), clinical research (Laine et al., 2007), and molecular biology (Baggerly, 2010; Carey and Stodden, 2010), and are increasingly embraced as standards to facilitate the post-publication peer review of all biomedical research.
In the ideal reproducible research, analyses are presented in such a way that the reader of results can understand most of or all the process that occurred from the gathering of the data to the reporting of specific analyses. At a minimum, that requires provision of study protocols with statistical-analysis plans,
statistical code, and information about how decisions were made to produce the analytic dataset from the raw measured data. Optimally, it involves some form of data-sharing. Such data sharing permitted the reanalysis of the RECORD trial that was presented to FDA in the rosiglitazone case. The review revealed that innumerable discrepancies and judgment calls frequently occurred in the original study—from defining a clinical event to the choice of analytic method—and those discrepancies and judgments affected the weight that the results were given in the regulatory decision-making process. For critical research that is to be the basis of regulatory decisions, which can be primary studies like RECORD or can be meta-analyses, standards should be developed within FDA to adhere to reproducible research principles so that the basis of the many judgments can be examined and adjudicated by scientists and regulators when disputes over data interpretation and its implications arise.
Going a step beyond reproducibility, FDA is well-positioned to help assure the accurate public reporting of risk information submitted to it as part of the premarketing approval process. These are often, but not always, published after approval and included in postmarketing safety assessments. FDA scientists themselves have identified the discordance of published data from that submitted to FDA as a problem for the validity of postmarketing safety meta-analyses (Hammad et al., 2011), and there are numerous examples of under or delayed reporting of harms that had been previously reported to regulatory authorities (for example, Carragee et al., 2011; Lee et al., 2008; Melander et al., 2003; Vedula et al., 2009). FDAAA addressed this problem by requiring that all clinical trials submitted for new drug approval or for new labeling be registered at inception at ClinicalTrials.gov, and that the summary results of all pre-specified outcomes be posted within one year of drug approval for new drugs, or three years for new indications (Miller, 2010; Wood, 2009). However, recently reported evidence has shown that compliance with this aspect of FDAAA has been low (Law et al., 2011). In addition, the FDA policy on the reporting of studies submitted for non-approved drugs has not been settled (Miller, 2010). Finally, publishing summary results is not equivalent to sharing primary data, which allows for re-analyses. New approaches are needed to facilitate the publication of safety data submitted to FDA for approved drugs, and to find ways to release similar data for drugs that are disapproved, but whose information might be extremely valuable for the interpretation of safety information from approved drugs in the same class.
Some of FDA’s most difficult decisions are those in which experts disagree about how compelling the evidence that informs the public health question is. Understanding the nature and sources of those disagreements and their implications for
FDA’s decisions is key to improving the agency’s decision-making process. For example, experts can disagree about the plausibility of a new risk (or decreased benefit) on the basis of different assessments of prior evidence, the quality of new data, the adequacy of confounding control in the relevant studies, the transportability of results, the appropriateness of the statistical analysis, the relevance of the new evidence to the public health question, how the evidence should be weighed and synthesized, or the threshold for regulatory actions.
FDA should use the framework for decision-making proposed in Recommendation 2.1 to ensure a thorough discussion and clear understanding of the sources of disagreement about the available evidence among all participants in the regulatory decision-making process. In the interest of transparency, FDA should use the BRAMP document proposed in Recommendation 2.2 to ensure that such disagreements and how they were resolved are documented and made public.
Such methods as Bayesian analyses or other approaches to integrating external relevant information with newly emerging information could provide decision-makers with useful quantitative assessments of evidence. An example would be sensitivity analyses of clinical-trial data that illustrate the influence of prior probabilities on estimates of probabilities that an intervention has unacceptable safety risks. These approaches can inform judgments, allow more rational decision-making, and permit input from multiple stakeholders and experts.
FDA should ensure that it has adequate expertise in Bayesian approaches, in combination with expertise in relevant frequentist and causal inference methods, to assess the probability that observed associations reflect actual causal effects, to incorporate multiple sources of uncertainty into the decision-making process, and to evaluate the sensitivity of those conclusions to different representations of external evidence. To facilitate the use of Bayesian approaches, FDA should develop a guidance document for the use of Bayesian methods for assessing a drug’s benefits, risks, and benefit–risk profile.
Traditionally, the main criteria for evaluating a study are ones that contribute to its internal validity. A well-conducted RCT typically has higher internal validity than a well-conducted observational study. Results of observational studies, however, can have greater transportability if their participants are more similar
to the target clinical population than to the participants in a clinical trial. In some circumstances, such as an evaluation of the association between a drug and an uncommon unexpected adverse event, observational studies may produce estimates closer to the actual risk in the general population than can be achieved in clinical trials. In assessing the relevance of study findings to a public health question, the transportability of the study results is as important as the determinants of its internal validity.
In assessing the benefits and risks associated with a drug in the postmarketing context, FDA should develop guidance and review processes that ensure that observational studies with high internal validity are given appropriate weight in the evaluation of drug harms and that transportability is given emphasis similar to that given bias and other errors in assessing the weight of evidence that a study provides to inform a public health question.
The principles of reproducible research are important for ensuring the integrity of postmarketing research used by FDA. Those principles include providing information on the provenance of data (from measurement to analytic dataset) and, when possible, making available properly annotated analytic datasets, study protocols (including statistical analysis plan) and their amendments, and statistical codes.
All analyses, whether conducted independently of FDA or by FDA staff, whose results are relied on for postmarketing regulatory decisions should use the principles of reproducible research when possible, subject to legal constraints. To that end, FDA should present data and analyses in a fashion that allows independent analysts either to reproduce the findings or to understand how FDA generated the results in sufficient detail to understand the strengths, weaknesses, and assumptions of the relevant analyses.
The ability of researchers in and outside FDA to analyze new information about the benefits and risks associated with a marketed drug and to design appropriate postmarketing research—including conducting individual-patient meta-analyses—is enhanced by access to data and analyses from all studies of the drug and others in the same drug class that were reported in the preapproval process. Although disclosure of such information is likely to advance the public’s health, such disclosures raise concerns about the privacy of participants in the research
that generated the information and may threaten industry interest in maintaining proprietary information, which is deemed important for innovation. New approaches to resolving this tension are needed.
FDA should establish and coordinate a working group, including industry and patient and consumer representatives, to find ways that appropriately balance public health, privacy, and proprietary interests to facilitate disclosure of data for trials and studies relevant to postmarketing research decisions.
The elements of the benefit–risk profile of a drug are best estimated by using all the available high-quality data, and meta-analysis is a useful tool for summarizing such data and evaluating heterogeneity. However, because the reporting of harms in published RCTs and observational studies is often poor or inconsistent and because there is often substantial publication bias in studies of drug risk, steps are needed to improve both the reporting of harms and the design of studies of harm. That can be done through prospective planning for selected meta-analyses and by monitoring compliance with the FDAAA requirement that summary trial results for all primary and secondary outcomes be published at ClinicalTrials.gov.
For drugs that are likely to have required postmarketing observational studies or trials, FDA should use the BRAMP to specify potential public health questions of interest as early as possible; should prospectively recommend standards for uniform definition of key variables and complete ascertainment of events among studies or convene researchers in the field to suggest such standards and promote data-sharing; should prospectively plan meta-analyses of the data with reference to specified exposures, outcomes, comparators, and covariates; should conduct the meta-analyses of the data; and should make appropriate regulatory decisions in a timely fashion. FDA can also improve the validity of meta-analyses by monitoring and encouraging compliance with FDAAA requirements for reporting to ClinicalTrials.gov.
FDA produced a high-quality guidance document on the use of the noninferiority design for the study of efficacy. Increasingly, FDA is using the noninferiority design to evaluate drug-safety endpoints as the primary outcomes in randomized trials. The use of noninferiority analyses to establish the acceptability of the benefit–risk profile of a drug can take the decision about how to balance the risks and benefits of two drugs out of the hands of regulators. Noninferiority trials also
have the disadvantage of being biased toward equivalence when trial design or conduct is suboptimal; this is of particular concern when such trials are used to estimate risks.
FDA should develop a guidance document on the design and conduct of noninferiority postmarketing trials for the study of safety of a drug. The guidance should include discussion of criteria for choosing the standard therapy to be used in the active-treatment control arm; of methods for selecting a noninferiority margin in safety trials and ensuring high-quality trial conduct; of the optimal analytic methods, including Bayesian approaches; and of the interpretation of the findings in terms of the drug’s benefit–risk profile.
FDA should closely scrutinize the design and conduct of any noninferiority safety studies for aspects that may inappropriately make the arms appear similar. FDA should use the observed-effect estimate and confidence interval as a basis for decision-making, not the binary noninferiority verdict.
AHRQ (Agency for Healthcare Research and Quality). 2008. U.S. Preventive Services Task Force procedure manual. Washington, DC: Department of Health and Human Services.
Baggerly, K. 2010. Disclose all data in publications. Nature 467(7314):401.
Baigent, C., A. Keech, P. M. Kearney, and L. Blackwell. 2005. Efficacy and safety of cholesterol-lowering treatment: Prospective meta-analysis of data from 90,056 participants in 14 randomised trials of statins. Lancet 366(9493):1267-1278.
Barton, M. B., T. Miller, T. Wolff, D. Petitti, M. LeFevre, G. Sawaya, B. Yawn, J. Guirguis-Blake, N. Calonge, R. Harris, and U.S. Preventive Services Task Force. 2007. How to read the new recommendation statement: Methods update from the U.S. Preventive Services Task Force. Annals of Internal Medicine 147(2):123-127.
Becker, M. C., T. H. Wang, L. Wisniewski, K. Wolski, P. Libby, T. F. Lüscher, J. S. Borer, A. M. Mascette, M. E. Husni, D. H. Solomon, D. Y. Graham, N. D. Yeomans, H. Krum, F. Ruschitzka, A. M. Lincoff, and S. E. Nissen. 2009. Rationale, design, and governance of Prospective Randomized Evaluation of Celecoxib Integrated Safety versus Ibuprofen or Naproxen (PRECISION), a cardiovascular end point trial of nonsteroidal antiinflammatory agents in patients with arthritis. American Heart Journal 157(4):606-612.
Bent, S., A. Padula, and A. L. Avins. 2006. Brief communication: Better ways to question patients about adverse medical events: A randomized, controlled trial. Annals of Internal Medicine 144(4):257-261.
Berry, D. A. 2006. Bayesian clinical trials. Nature Reviews Drug Discovery 5(1):27-36.
Berry, D. A., M. C. Wolff, and D. Sack. 1992. Public health decision making: A sequential vaccine trial. In Bayesian statistics, edited by J. Bernardo, J. Berger, A. Dawid and A. Smith. Oxford, UK: Oxford University Press. Pp. 79-96.
Camm, A. J., A. Capucci, S. H. Hohnloser, C. Torp-Pedersen, I. C. Van Gelder, B. Mangal, and G. Beatch. 2011. A randomized active-controlled study comparing the efficacy and safety of vernakalant to amiodarone in recent-onset atrial fibrillation. Journal of the American College of Cardiology 57(3):313-321.
Campbell, G. 2011. Bayesian statistics in medical devices: Innovation sparked by the FDA. Journal of Biopharmaceutical Statistics 21(5):871-887.
Carey, V. J., and V. Stodden. 2010. Reproducible research concepts and tools for cancer bioinformatics. In Biomedical informatics for cancer research, edited by M. F. Ochs, J. T. Casagrande and R. V. Davuluri. Springer US. Pp. 149-175.
Carpenter, D. 2010. Reputation and power institutionalized: Scientific networks, congressional hearings, and judicial affirmation, 1963-1986. In Reputation and power: Organizational image and pharmaceutical regulation at the FDA. Cambridge, NY: Princeton University Press. Pp. 298-392.
Carragee, E. J., E. L. Hurwitz, and B. K. Weiner. 2011. A critical review of recombinant human bone morphogenetic protein-2 trials in spinal surgery: Emerging safety concerns and lessons learned. Spine Journal 11(6):471-491.
Chaloner, K. 1996. Elicitation of prior distributions. In Bayesian biostatistics, edited by D. A. Berry and D. K. Stangl. New York: Marcel Dekker.
Chan, A.-W., A. Hróbjartsson, M. T. Haahr, P. C. Gøtzsche, and D. G. Altman. 2004. Empirical evidence for selective reporting of outcomes in randomized trials. JAMA 291(20):2457-2465.
Chowdhury, B. A., and G. Dal Pan. 2010. The FDA and safe use of long-acting beta-agonists in the treatment of asthma. New England Journal of Medicine 362(13):1169-1171.
Chowdhury, B. A., S. M. Seymour, and M. S. Levenson. 2011. Assessing the safety of adding LABAs to inhaled corticosteroids for treating asthma. New England Journal of Medicine 364(26):2473-2475.
Claxton, K., J. T. Cohen, and P. J. Neumann. 2005. When is evidence sufficient? Health Affairs 24(1):93-101.
Cooper, H., and E. A. Patall. 2009. The relative benefits of meta-analysis conducted with individual participant data versus aggregated data. Psychological Methods 14(2):165-176.
Dal Pan, G. J. 2010. Memorandum from Gerald Dal Pan to Janet Woodcock (dated September 12, 2010). Re: Recommendations for regulatory action for rosiglitazone and rosiglitazone-containing products (NDA 21-071, supplement 035, incoming submission dated August 25, 2009). Washington, DC: Department of Health and Human Services.
Darby, S., P. McGale, C. Correa, C. Taylor, R. Arriagada, M. Clarke, D. Cutter, C. Davies, M. Ewertz, J. Godwin, R. Gray, L. Pierce, T. Whelan, Y. Wang, and R. Peto. 2011. Effect of radiotherapy after breast-conserving surgery on 10-year recurrence and 15-year breast cancer death: Meta-analysis of individual patient data for 10,801 women in 17 randomised trials. Lancet 378(9804):1707-1716.
Davies, C., J. Godwin, R. Gray, M. Clarke, D. Cutter, S. Darby, P. McGale, H. C. Pan, C. Taylor, Y. C. Wang, M. Dowsett, J. Ingle, and R. Peto. 2011. Relevance of breast cancer hormone receptors and other factors to the efficacy of adjuvant tamoxifen: Patient-level meta-analysis of randomised trials. Lancet 378(9793):771-784.
Emerson, S. S., J. M. Kittelson, and D. L. Gillen. 2007. Bayesian evaluation of group sequential clinical trial designs. Statistics in Medicine 26(7):1431-1449.
Eraker, S. A., J. P. Kirscht, and M. H. Becker. 1984. Understanding and improving patient compliance. Annals of Internal Medicine 100(2):258.
Erik, C. 2007. Methodology of superiority vs. equivalence trials and non-inferiority trials. Journal of Hepatology 46(5):947-954.
Etzioni, R. D., and J. B. Kadane. 1995. Bayesian statistical methods in public health and medicine. Annual Review of Public Health 16(1):23-41.
FDA (US Food and Drug Administration). 2008. Guidance for industry. Diabetes mellitus—evaluating cardiovascular risk in new antidiabetic therapies to treat type 2 diabetes. Washington, DC: Department of Health and Human Services.
FDA. 2010a. Guidance for industry and FDA staff: Guidance for the use of Bayesian statistics in medical device clinical trials. Rockville, MD: Department of Health and Human Services.
FDA. 2010b. Guidance for industry: Non-inferiority clinical trials, draft guidance. Washington, DC: Department of Health and Human Services.
FDA. 2010c. FDA briefing document. Advisory committee meeting for NDA 21071: Avandia (rosiglitazone maleate tablet). Silver Spring, MD: Department of Health and Human Services.
FDA. 2012. Classifying significant postmarketing drug safety issues: Draft guidance. Washington, DC: Department of Health and Human Services.
Fisher, D. J., A. J. Copas, J. F. Tierney, and M. K. B. Parmar. 2011. A critical review of methods for the assessment of patient-level interactions in individual participant data meta-analysis of randomized trials, and guidance for practitioners. Journal of Clinical Epidemiology 64(9):949-967.
Fisher, L. D. 1999. Carvedilol and the Food and Drug Administration (FDA) approval process: The FDA paradigm and reflections on hypothesis testing. Controlled Clinical Trials 20(1):16-39.
Fleming, T. R. 2008. Current issues in non-inferiority trials. Statistics in Medicine 27(3):317-332.
Fleming, T.R., K. Odem-Davis, M. Rothmann, and Y. Li Shen. 2011. Some essential considerations in the design and conduct of non-inferiority trials. Clinical Trials 8:432-439.
Frank, E., G. B. Cassano, P. Rucci, A. Fagiolini, L. Maggi, H. C. Kraemer, D. J. Kupfer, B. Pollock, R. Bies, V. Nimgaonkar, P. Pilkonis, M. K. Shear, W. K. Thompson, V. J. Grochocinski, P. Scocco, J. Buttenfield, and R. N. Forgione. 2008. Addressing the challenges of a cross-national investigation: Lessons from the Pittsburgh-PISA study of treatment-relevant phenotypes of unipolar depression. Clinical Trials 5(3):253-261.
Furberg, C. D., and B. Pitt. 2001. Commentary: Withdrawl of cerivastatin from the world market. Current Controlled Trials in Cardiovascular Medicine 2(5):205-207.
GAO (Government Accountability Office). 2010a. Drug safety: FDA has conducted more foreign inspections and begun to improve its information on foreign establishments, but more progress is needed. Washington, DC: Government Accountability Office.
GAO. 2010b. Food and Drug Administration: Overseas offices have taken steps to help ensure import safety, but more long-term planning is needed. Washington, DC: Government Accountability Office.
GAO. 2010c. New drug approval: FDA’s consideration of evidence from certain clinical trials. Washington, DC: Government Accountability Office.
Garrison, L. P., Jr., P. J. Neumann, P. Radensky, and S. D. Walcoff. 2010. A flexible approach to evidentiary standards for comparative effectiveness research. Health Affairs 29(10):1812-1817. Gelfand, A. E., and B. K. Mallick. 1995. Bayesian analysis of proportional hazards models built from monotone functions. Biometrics 51(3):843-852.
Golder, S., Y. K. Loke, and M. Bland. 2011. Meta-analyses of adverse effects data derived from randomised controlled trials as compared to observational studies: Methodological overview. PLoS Med 8(5):e1001026.
Good, I. J. 1950. Probability and the weighting of evidence. London, UK: Charles Griffin & Co.
Goodman, S. N. 1999. Toward evidence-based medical statistics. 2: The Bayes factor. Annals of Internal Medicine 130(12):1005-1013.
Goodman, S. N. 2001. Of P-values and Bayes: A modest proposal. Epidemiology 12(3):295-297.
Gordis, L. 2004. Epidemiology. Third ed. Philadelphia, PA: Elsevier Inc.
Graham, D. J., and K. Gelperin. 2010a. Memorandum to Mary Parks regarding comments on RECORD, TIDE, and the benefit-risk assessment of rosiglitazone vs. pioglitazone. In FDA Briefing Document Advisory Committee Meeting for NDA 21071: Avandia (rosiglitazone maleate) tablet: July 13 and 14, 2010. Washington, DC: Department of Health and Human Services.
Graham, D. J., and K. Gelperin. 2010b. TIDE and benefit-risk considerations. http://www.fda.gov/downloads/AdvisoryCommittees/CommitteesMeetingMaterials/Drugs/EndocrinologicandMeta bolicDrugsAdvisoryCommittee/UCM224732.pdf (accessed October 11, 2011).
Greene, B. M., A. M. Geiger, E. L. Harris, A. Altschuler, L. Nekhlyudov, M. B. Barton, S. J. Rolnick, J. G. Elmore, and S. Fletcher. 2006. Impact of IRB requirements on a multicenter survey of prophylactic mastectomy outcomes. Annals of Epidemiology 16(4):275-278.
Greenhouse, J. B., and L. Waserman. 1995. Robust Bayesian methods for monitoring clinical trials. Statistics in Medicine 14(12):1379-1391.
Guyatt, G. H., A. D. Oxman, G. E. Vist, R. Kunz, Y. Falck-Ytter, P. Alonso-Coello, and H. J. Schunemann. 2008. GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. BMJ 336(7650):924-926.
Hamburg, M. A. 2011. Commentary: The growing role of epidemiology in drug safety regulation. Epidemiology 22(5):622-624.
Hammad, T. A., S. P. Pinheiro, and G. A. Neyarapally. 2011b. Secondary use of randomized controlled trials to evaluate drug safety: A review of methodological considerations. Clinical Trials 8(5):559-570.
Hernán, M. A., and S. Hernandez-Diaz. 2012. Beyond the intention-to-treat in comparative effectiveness research. Clinical Trials 9(1):48-55.
Hernán, M. A., and C. Robins. 2012. Causal inference. New York: Chapman & Hall/CRC.
Hernán, M. A., and J. M. Robins. 2006. Instruments for causal inference: An epidemiologist’s dream? Epidemiology 17(4):360-372.
Ioannidis, J. P. A., and J. Lau. 2001. Completeness of safety reporting in randomized trials: An evaluation of 7 medical areas. JAMA 285(4):437-443.
Ioannidis, J. P. A., S. J. W. Evans, P. C. Gøtzsche, R. T. O’Neill, D. G. Altman, K. Schulz, and D. Moher. 2004. Better reporting of harms in randomized trials: An extension of the consort statement. Annals of Internal Medicine 141(10):781-788.
Ioannidis, J. P., C. D. Mulrow, and S. N. Goodman. 2006. Adverse events: The more you search, the more you find. Annals of Internal Medicine 144(4):298-300.
IOM (Institute of Medicine). 2008. Improving the presumptive disability decision-making process for veterans. Washington, DC: The National Academies Press.
Ives, D. G., A. L. Fitzpatrick, D. E. Bild, B. M. Psaty, L. H. Kuller, P. M. Crowley, R. G. Cruise, and S. Theroux. 1995. Surveillance and ascertainment of cardiovascular events: The Cardiovascular Health Study. Annals of Epidemiology 5(4):278-285.
Ives, D. G., P. Samuel, B. M. Psaty, and L. H. Kuller. 2009. Agreement between nosologist and cardiovascular health study review of deaths: Implications of coding differences. Journal of the American Geriatrics Society 57(1):133-139.
Jencks, S. F., D. K. Williams, and T. L. Kay. 1988. Assessing hospital-associated deaths from discharge data. JAMA 260(15):2240-2246.
Jenkins, J. K. 2010. Memorandum from John Jenkins to Janet Woodcock (dated September, 2010). Re: Recommendations for regulatory actions—Rosiglitazone. Washington, DC: US Food and Drug Administration.
Jones, A. P., R. D. Riley, P. R. Williamson, and A. Whitehead. 2009. Meta-analysis of individual patient data versus aggregate data from longitudinal clinical trials. Clinical Trials 6(1): 16-27.
Juurlink, D. N. 2010. Rosiglitazone and the case for safety over certainty. JAMA 304(4):469-471.
Kadane, J. B. 2005. Bayesian methods for health-related decision making. Statistics in Medicine 24(4):563-567.
Kadane, J., and L. J. Wolfson. 1998. Experiences in elicitation. Journal of the Royal Statistical Society: Series D (The Statistician) 47(1):3-19.
Kaizar, E. E., J. B. Greenhouse, H. Seltman, and K. Kelleher. 2006. Do antidepressants cause suicidality in children? A Bayesian meta-analysis. Clinical Trials 3(2):73-90; discussion 91-98.
Kass, R. E., and A. E. Raftery. 1995. Bayes factors. Journal of the American Statistical Association 90(430):773-795.
Kaul, S., and G. A. Diamond. 2006. Good enough: A primer on the analysis and interpretation of noninferiority trials. Annals of Internal Medicine 145(1):62-69.
Kaul, S., and G. A. Diamond. 2007. Making sense of noninferiority: A clinical and statistical perspective on its application to cardiovascular clinical trials. Progress in Cardiovascular Diseases 49(4):284-299.
Kufner, S., A. de Waha, F. Tomai, S.-W. Park, S.-W. Lee, D.-S. Lim, M. H. Kim, A. M. Galloe, M. Maeng, C. Briguori, A. Dibra, A. Schömig, and A. Kastrati. 2011. A meta-analysis of specifically designed randomized trials of sirolimus-eluting versus paclitaxel-eluting stents in diabetic patients with coronary artery disease. American Heart Journal 162(4):740-747.
Laine, C., S. N. Goodman, M. E. Griswold, and H. C. Sox. 2007. Reproducible research: Moving toward research the public can really trust. Annals of Internal Medicine 146(6): 450-453.
Lanctot, K. L., and C. A. Naranjo. 1995. Comparison of the Bayesian approach and a simple algorithm for assessment of adverse drug events. Clinical Pharmacology & Therapeutics 58(6):692-698.
Lau, H. S., A. de Boer, K. S. Beuning, and A. Porsius. 1997. Validation of pharmacy records in drug exposure assessment. Journal of Clinical Epidemiology 50(5):619-625.
Laughren, T. P. 2006. Overview for December 13 meeting of psychopharmacologic drugs advisory committee (PDAC).
Law, M. R., Y. Kawasumi, and S. G. Morgan. 2011. Despite law, fewer than one in eight completed studies of drugs and biologics are reported on time on ClinicalTrials.gov. Health Affairs 30(12):2338-2345.
Lee, K., P. Bacchetti, and I. Sim. 2008. Publication of clinical trials supporting successful new drug applications: A literature analysis. PLoS Medicine 5(9):e191.
Lesaffre, E. 2008. Superiority, equivalence, and non-inferiority trials. Bulletin of the NYU Hospital for Joint Diseases 66(2):150-154.
Levenson, M., and C. Holland. 2006. Slide presentation: Antidepressants and suicidality in adults: Statistical evaluation. http://www.fda.gov/ohrms/dockets/ac/06/slides/2006-4272s1-04-FDA_files/frame.htm (accessed April 6, 2012).
Lilford, R. J., M. A. Mohammed, D. Braunholtz, and T. P. Hofer. 2003. The measurement of active errors: Methodological issues. Quality and Safety in Health Care 12(Suppl 2):ii8-ii12.
Lin, D. Y., and D. Zeng. 2010. Meta-analysis of genome-wide association studies: No efficiency gain in using individual participant data. Genetic Epidemiology 34(1):60-66.
Madigan, D., P. Ryan, S. E. Simpson, and I. Zorych. 2010. Bayesian methods in pharmacovigilance. In Bayesian statistics 9, edited by J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West. Oxford, UK: Oxford University Press.
Manion, F., R. Robbins, W. Weems, and R. Crowley. 2009. Security and privacy requirements for a multi-institutional cancer research data grid: An interview-based study. BMC Medical Informatics and Decision Making 9(1):31.
Marciniak, T. A. 2010. Memorandum from Thomas Marciniak to Jena Weber (dated June 14, 2010) regarding cardiovascular events in RECORD, NDA 21-071/s-035. In FDA Briefing Document Advisory Committee Meeting for NDA 21071: Avandia (rosiglitazone maleate) tablet: July 13 and 14, 2010. Washington, DC: Department of Health and Human Services.
McEvoy, B., R. R. Nandy, and R. C. Tiwari. 2012. Applications of Bayesian model selection criteria for clinical safety data (abstract, ASA joint statistical meetings). http://www.amstat.org/meetings/jsm/2012/onlineprogram/AbstractDetails.cfm?abstractid=305627 (accessed April 5, 2012).
Melander, H., J. Ahlqvist-Rastad, G. Meijer, and B. Beermann. 2003. Evidence b(i)ased medicine—selective reporting from studies sponsored by pharmaceutical industry: Review of studies in new drug applications. BMJ 326(7400):1171-1173.
Miller, J. D. 2010. Registering clinical trial results: The next step. JAMA 303(8):773-774.
Misbin, R. I. 2007. Lessons from the Avandia controversy: A new paradigm for the development of drugs to treat type 2 diabetes. Diabetes Care 30(12):3141-3144.
Nissen, S. E., and K. Wolski. 2007. Effect of rosiglitazone on the risk of myocardial infarction and death from cardiovascular causes. New England Journal of Medicine 356(24):2457-2471.
NRC (National Research Council). 2010. The prevention and treatment of missing data in clinical trials. Panel on handling missing data in clinical trials. Washington, DC: The National Academies Press.
Owens, D. K., K. N. Lohr, D. Atkins, J. R. Treadwell, J. T. Reston, E. B. Bass, S. Chang, and M. Helfand. 2010. AHRQ series paper 5: Grading the strength of a body of evidence when comparing medical interventions-Agency for Healthcare Research and Quality and the Effective Health-care Program. Journal of Clinical Epidemiology 63(5):513-523.
Oxford Dictionaries. 2011. Oxford English Dictionary online. Oxford University Press.
Parks, M. H. 2010. Memorandum from Mary Parks to Curtis Rosebraugh (dated August 19, 2010). Re: Recommendations on marketing status of Avandia (rosiglitazone maleate) and the required post-marketing trial, Thiazolidinedione Intervention and Vitamin D Evaluation (TIDE) following the July 13 and 14, 2010 Public Advisory Committee Meeting. Silver Spring, MD: US Food and Drug Administration.
Parmigiani, G. 2002. Modeling in medical decision making: A Bayesian approach (statistics in practice). New York: Wiley.
Peng, R. D., F. Dominici, and S. L. Zeger. 2006. Reproducible epidemiologic research. American Journal of Epidemiology 163(9):783-789.
PMA (Cochrane Prospective Meta-Analysis Methods Group). 2010. Welcome: The prospective meta-analysis methods group. http://pma.cochrane.org/ (accessed December 12, 2011).
Psaty, B. M., and D. S. Siscovick. 2010. Minimizing bias due to confounding by indication in comparative effectiveness research: The importance of restriction. JAMA 304(8):897-898.
Psaty, B. M., R. Boineau, L. H. Kuller, and R. V. Luepker. 1999. The potential costs of upcoding for heart failure in the United States. The American Journal of Cardiology 84(1):108-109.
Psaty, B. M., C. J. O’Donnell, V. Gudnason, K. L. Lunetta, A. R. Folsom, J. I. Rotter, A. G. Uitterlinden, T. B. Harris, J. C. M. Witteman, E. Boerwinkle, and (on behalf of the CHARGE Consortium). 2009. Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) consortium. Circulation: Cardiovascular Genetics 2(1):73-80.
Reade, M., A. Delaney, M. Bailey, D. Harrison, D. Yealy, P. Jones, K. Rowan, R. Bellomo, and D. Angus. 2010. Prospective meta-analysis using individual patient data in intensive care medicine. Intensive Care Medicine 36(1):11-21.
Royall, R. M. 1997. Statistical evidence: A likelihood paradigm. London, UK: Chapman & Hall.
Saunders, K., K. Dunn, J. Merrill, M. Sullivan, C. Weisner, J. Braden, B. Psaty, and M. Von Korff. 2010. Relationship of opioid use and dosage levels to fractures in older chronic pain patients. Journal of General Internal Medicine 25(4):310-315.
Staffa, J. A., J. Chang, and L. Green. 2002. Cerivastatin and reports of fatal rhabdomyolysis. New England Journal of Medicine 346(7):539-540.
Talbot, J. C. C., and P. Walker. 2004. Stephens’ detection of new adverse drug reactions. 5th ed. West Sussex, England: John Wiley & Sons Ltd.
Temple, R., and S. S. Ellenberg. 2000. Placebo-controlled trials and active-control trials in the evaluation of new treatments. Part 1: Ethical and scientific issues. Annals of Internal Medicine 133(6):455-463.
Ten Have, T. R., S. L. Normand, S. M. Marcus, C. H. Brown, P. Lavori, and N. Duan. 2008. Intent-to-treat vs. non-intent-to-treat analyses under treatment non-adherence in mental health randomized trials. Psychiatrics Annals 38(12):772-783.
Thomas, E. J., and L. A. Petersen. 2003. Measuring errors and adverse events in health care. Journal of General Internal Medicine 18(1):61-67.
Thompson, S. G., and S. J. Sharp. 1999. Explaining heterogeneity in meta-analysis: A comparison of methods. Statistics in Medicine 18:2693-2708.
Toh, S., and M. A. Hernán. 2008. Causal inference from longitudinal studies with baseline randomization. International Journal of Biostatistics 4(1):Article22.
Turner, E. H., A. M. Matthews, E. Linardatos, R. A. Tell, and R. Rosenthal. 2008. Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine 358(3):252-260.
Unger, E. 2010. Memorandum to the file regarding NDA: 21-071; suppl 35, 36, 37 Avandia (rosiglitazone). In FDA Briefing Document: Advisory Committee Meeting for NDA 21071: Avandia (rosiglitazone maleate) tablet: July 13 and 14, 2010. Washington, DC: Department of Health and Human Services.
Vandenbroucke, J. P. 2006. What is the best evidence for determining harms of medical treatment? Canadian Medical Association Journal 174(5):645-646.
Vandenbroucke, J. P., and B. M. Psaty. 2008. Benefits and risks of drug treatments: How to combine the best evidence on benefits with the best data about adverse effects. JAMA 300(20):2417-2419.
Vedula, S. S., L. Bero, R. W. Scherer, and K. Dickersin. 2009. Outcome reporting in industry-sponsored trials of gabapentin for off-label use. New England Journal of Medicine 361(20):1963-1971.
Weiss, N. S., T. D. Koepsell, and B. M. Psaty. 2008. Generalizability of the results of randomized trials. Archives of Internal Medicine 168(2):133-135.
Wood, A. J. J. 2009. Progress and deficiencies in the registration of clinical trials. New England Journal of Medicine 360(8):824-830.
Yap, J. S. 2010. Statistical review and evaluation: Clinical studies NDA 21-071/35 and 21-073. In FDA Briefing Document Advisory Committee Meeting for NDA 21071: Avandia (rosiglitazone maleate) tablet: July 13 and 14, 2010. Washington, DC: Department of Health and Human Services.