Begin with a question such as "Does exposure X cause disease Y?" The premise of epidemiology is so deceptively simple that it can be described in two sentences:
- Scientists compare two groups of people that are alike in all ways except that one group was exposed to X and the other group was not.
- If more people in the exposed group than in the other group have the disease, Y, scientists have an epidemiologic clue that exposure X may be harmful. (Note: We have not proven that X causes Y; we have shown that in this sample X and Y occur together more often than we would have expected them to by chance.)
What, however, takes scores of technical textbooks and fuels ongoing debates are the "how to" and "what if," "buts,'' "on the other hands,'' and "howevers" that make all the difference between error-laden, error-tinged, and accurate study results. In the next few pages, we describe several known pitfalls and techniques for avoiding them. That should provide a basic background to enable non-technically oriented readers to dig into this report.
Epidemiology is the study of the distribution and determinants of disease and its effects (e.g., death) in human populations. While examining data, rather than people (as in clinical research) or animals or chemicals (as in laboratory research), epidemiologic analyses seek to understand causation. Epidemiology attempts to tease out the relationships between factors—be they characteristics of people (e.g., age, race, sex), or their work (tension-filled or relaxed, indoors or outdoors) or home (sufficient or insufficient food, shelter, and social support)
environments; characteristics of potentially harmful factors (viruses, poverty, metabolic disturbances, high cholesterol, or radiation) or beneficial factors (including new medication, surgery, medical devices, health education, income, and housing); or measures of health status (mortality rates, cholesterol levels, or disease incidence). Notice that one factor can be at once a characteristic, risk factor, and outcome. A key distinction between epidemiologic and experimental data is that epidemiologic studies usually are not designed experiments with purebred animal subjects randomized to be exposed or not exposed. Rather, one makes use of exposure situations that have occurred for various reasons to learn what one can. This is essential in situations such as the study of CROSSROADS participation where a randomized design is impossible retrospectively.
It is important to understand that while epidemiology seeks to understand causal pathways, it cannot prove causation. Epidemiology uses judgment, statistics, and skepticism to reach descriptions and interpretations of relationships and associations. It is both a practical technique and an intellectual framework for considering the possibilities of causal relationships. It is the approach we have taken in this study.
Epidemiologists compare groups. The key to making sound comparisons is in choosing groups that are alike in all ways except for the matter being studied. This selection of comparison groups is where the science, mathematics, and art of good epidemiology are blended. For example, because age and sex are associated with health risks and conditions, data regarding age and sex are collected, making it possible in the analysis to either compare like age distributions and sexes or statistically adjust the data to account for known differences.
Choice of Comparison Group
In studying CROSSROADS participants, comparison group options include the development of a specific control group, internal comparisons by level of exposure, and use of national statistics. Each carries useful and restrictive elements.
If, for example, one wants to study the effect of something on lung cancer, knowing what we do about cigarette smoking and lung cancer, we would want to pick two groups to compare that do not differ in smoking practices, for that difference could mask the true causal relationship we are looking to explore. In studies of military participants, it helps to use a reference group that is also military. After checking age and sex, we rest a bit more comfortably that the two groups are rather likely to be similar on a host of unmeasured characteristics—such as smoking behavior. If, however, we chanced to compare the woodwind section of the Navy band (good breathers) with an average group of smokers, we could encounter differences attributable to smoking behavior.
Closer to the concerns of this study, we would not want to compare a group exposed to nuclear test radiation with a group drawn from radiation workers. (Although if there were a few radiation workers in a much greater number of comparison group members, any possible confounding would be very diluted.)
Study results hinge on differences between the two (or more) groups compared in the study. So, choice of comparison group(s) is an extremely important task, one that has both conceptual and practical aspects. Consistent findings over hundreds of different disease-exposure inquiries demonstrate what we refer to as a "healthy worker effect." With no hypothesized harmful exposure, a cohort of workers or soldiers is expected to be healthier, as reflected in mortality and morbidity rates, than a general cohort. To be included in the soldier or worker cohort, the individual has to be mentally and physically functioning at or above whatever level is required for the duties of that cohort. In the extreme, those on their "deathbeds" are not hired or recruited. Furthermore, individuals are excluded from military service if they are not ''fit," according to clinical and laboratory findings. Numerous studies have confirmed that this healthy worker effect is most pronounced in measurements taken close to the time of hiring (or entry into military service) but continues for decades.
Using a military comparison group addresses and avoids the healthy soldier effect but does carry other drawbacks. While government and other groups routinely gather statistics (including demographic, health, and employment descriptors) on general populations, such as U.S. males aged 45–65, data are not readily available for more finely (or even grossly) honed comparison groups in the military or elsewhere. Using a specifically designed comparison group, therefore, adds expense and time to a study. Furthermore, it increases the opportunity to introduce confounding information that could bias the findings.
Many of these difficulties can be overcome with meticulous attention to technique, innovative study designs and analytic plans, and a balanced view of what statistics do and do not say. These options are difficult to weigh for practiced scientists and no less difficult to explain to and discuss with nontechnically trained readers; misunderstanding between scientist and public often occurs.
One option is to compare the group in question (for example, military personnel who participated in nuclear tests) with more than one comparison group, aiming to tease out relationships between exposure and outcome by seeing similarities and differences in those comparisons. The current CROSSROADS study is structured around a military comparison group, chosen to match on age, rank, time period, and military occupation—all available characteristics—but specifically not CROSSROADS test participants. Secondarily, we included statistical comparisons with the general U.S. male population.
Fine Tuning of Exposed Group
Although "participant" vs. "nonparticipant" is an intuitively reasonable place to start analysis in this study, there are intricate details to consider. Foremost, not all "participants" received the same amount of exposure (or potential exposure, measured exposure, expected exposure, or type of exposure) as all the other participants.
We look, therefore, for some way(s) of measuring the amount of exposure and then characterizing individuals in relation to their known (or expected or hypothesized) dose (amount of exposure). Otherwise, if only a few of the participants were exposed, any effect (on cancer mortality, for example) would be diluted because most of the "exposed" were actually "not exposed" (or minimally exposed) and would not reflect the exposure-disease association. No difference would be observed and we would not know whether that meant there was indeed no difference or the comparison groups were identified in ways in which a real difference could not be observed.
Because adequate direct exposure measurements are not always available, researchers attempt to develop surrogate measures of exposure. In this study we pursued data from actual dosimetry measurements made at the time of the nuclear tests, recalculations done to address the known incompleteness of those measures, self-reports of participants, and coherent assumptions based on knowledge of radiation physics, troop logistics, on-site reportage, logs, and documents as well as logic.
It will come as no surprise that some characteristics—such as age and sex—are associated with numerous measures of health status. They are, also, associated with military experience in general and CROSSROADS participation in particular. These are likely confounders (things that confuse a straightforward comparison), because they are characteristics associated with both the outcome and the putative causative element under study. While a military comparison group based on broad categories of age, sex, similar unit assignment, and military rank provides some assurance of comparability, differences are still likely to exist. When we know what the confounders are and we can measure them, we can take them into account in the statistical analysis. Careful choice of comparison groups can help to limit the effect of unknown confounders. Chapters 10 and 11 of this report describe the design and analytic steps we took to control for potential confounding.
Examples of characteristics that frequently confound exposure-disease associations include age, race, sex, socioeconomic status, occupation, and various behaviors, such as alcohol and tobacco use. In specific studies
investigators may hypothesize potential confounders such as ethnicity; military service-related exposures, including sunlight, altitude, preventive and therapeutic attention to infectious disease as well as the diseases themselves; and other risks based on lifestyle, geography, and postmilitary careers.
Once researchers have chosen the groups to study, avoiding the pitfalls—or at least, recognizing and measuring them as best as possible for later adjustment, they face a new set of problems during the planning and conduct of data collection. If you plan to get information directly from the subject, you need to do all you can to find all subjects, regardless of their being in the case/participant or control/comparison group and regardless of the outcome under study. If you are getting information from records, you need to get records for all subjects, again regardless of their being in the case/participant or control/comparison group and regardless of the outcome under study.
For example, if you are attempting to get information from subjects themselves and want to find out mortality rates and gather information by phone, you will not find anyone to be dead. Conversely, if you look only at death certificates, you will not find anyone alive. These somewhat tongue-in-cheek extremes are easy to avoid; the shades of gray around and between them, however, are often stumbling blocks in data collection and then analysis and interpretation. The reasons are that there are biases in record systems: not all records have an equal likelihood of being retrieved. For example, in looking at hospital records, specific cases involved in lawsuits may be in the general counsel's office and not in the clinic's file, where they would normally be found. There are also mundane reasons for all data not being equally available: records can be lost or destroyed, intentionally or unintentionally, by flood or fire, as in the case of veterans' records at the National Personnel Records Center in St. Louis (see Chapter 7). Note that bias does not necessarily mean prejudicial treatment, but would include any process that systematically treats one set of records differently than another.
To minimize possible biases, a number of general rules and protocols have evolved to guide researchers—regardless of participant or comparison group and regardless of likely outcome. These protocols include developing an understanding of all data sources and how they may be expected to affect data distributions and establishing clear decision rules. A summary list of rules could include:
- ensuring that there is an equal likelihood of finding records of people in each group; if a source of data is available for only one group, do not use it.
- being aware of biases built into record systems. There are potentially many of these: people with illness are more likely to seek care; veterans with
- lower incomes or service-connected disabilities are more likely to seek VA care; care-seeking behavior varies over time (for example, as VA benefits change); medical record technologies change; whether patients or family members have concerns about benefits or suspicions of causation could influence whether they notify the recordkeeping agency; data may be missing due to circumstances beyond human control, such as a fire destroying paper files; and data accuracy is associated with level of ascertainment, such as completeness of fact-of-death, date-of-death, or cause-of-death information.
- using a firm cut-off date for the follow-up period. It is necessary to treat participants and comparisons equally when it comes to data collection, follow-up, and maintenance. The decisions made should be definable. Researchers should examine—according to biologic, logistical, and cost implications—choices involving latency periods, cohort age, or pending compensation questions. Once cut-offs are chosen, it is best to recognize and honor the choice (although it may seem arbitrary in practice).
- recognizing that raw numbers offer different information than do rates or proportions. The latter include a context for interpreting the importance of the raw number. While reporting the number of people dead is often informative, it is insufficient to use percentages without first identifying a conceptually acceptable denominator and then using the entire denominator in any calculation. For example, when examining constructs such as "average age at death," one should account for the amount of time available for observations since the average will change over time as larger proportions of the sample die. For example, let's follow the mortality experience of a hypothetical sixth-grade class of 25 students in 1923. Looking at them in 1925, after one 13-year-old died in a motor vehicle accident, we would see an average age at death of 13 years. If no one else in that class were to die over the next 15 years, then, in 1940, the average age at death would still be 13 because all members of the cohort who had died (in this case one person) did so at age 13. By 1975 (the original children would now be about 61 years old), perhaps another 10 had died; the average age at death would be higher than 13, but necessarily lower than 61. The average would depend on when the deaths occurred within that period. The average age of death calculated at any point in time is the average of the ages at death for all members deceased by that point in time. The average will change over time as more deaths are added into the calculations. The average does not reflect the total mortality experience of the group until all members have died. Statistical techniques have been developed to even out such things, so that numbers can be compared meaningfully.
These comments show the bridges among data collection, reporting, and analysis. In the following sections, we continue with analysis issues.
Interpreting Data Findings
Let us say that comparison groups were chosen appropriately, unbiased data collected, and one group has more disease than the other. Epidemiology provides for the use of judgment in considering whether a numerical relationship might reflect a causal one. The criteria of causal judgment—which have been stated in many contexts—involve two broad considerations: Are the exposure and the outcome associated? Does that association make sense, based on biological as well as other physical, historical, and study design factors?
Epidemiology studies are designed to describe numerical associations between factors (risks, treatments, outcomes). In interpreting the results we look at characteristics of those associations. Evidence supporting a causal association mounts if the association is consistent (observed in a variety of studies addressing the same type of exposure in different circumstances), strong (e.g., with high relative risk ratios), and specific. Statistics serve as a tool to quantify the strength of associations relative to random background fluctuations, which are more likely to be observed the smaller the sample considered. Through mathematical theory and centuries of data analysis, statisticians have derived (and continue to derive) methods to deal with multiple comparisons, effects of misclassification, inferences from samples, and combining data from diverse (but not too diverse) studies.
Vital to the epidemiologist's examination of data are the issues of statistical measures and variability. Starting with a sample of people, we generate statistical measures (or statistics, for short) that summarize some important information collected on them (e.g., death rates). Variability enters the picture when we take a particular sample, because the statistics we generate for that particular sample will be specific to that sample; a different sample would generate different statistics because the individuals in one sample are not the same as in the other. Yet, if a sample has been selected essentially at random and something is known or assumed about the distribution of the statistics generated from that particular sample, then we can make some general statements about the variability of those statistics.
Typically, we characterize a particular statistical measure's variability by quantifying how much it would vary just by taking different samples and recalculating that same statistic. In general, it turns out that the larger the sample, the smaller the variability. It is customary to calculate two limits, called the lower and upper 95 percent confidence limits, that have the property that if we repeatedly drew samples and recalculated the statistic, these different values would lie between the upper and lower confidence limits 95 times out of 100. The interval between the upper and lower confidence limits is thus called a 95 percent confidence interval. The wider the confidence interval, the more variability there is in the statistic.
It is frequently of interest to know what the variability of a statistic is because it affects its interpretation. If the mortality rates of participants and controls are equal, for example, then the ratio of these two rates (the rate ratio) should be 1.0. However, there is inherent variability in this rate ratio statistic, so that we want to calculate its 95 percent confidence interval. If the ratio is only slightly more than or less than 1.0, for example, by an amount that lies within the confidence interval, we customarily conclude that this small deviation from 1.0 could be attributed to inherent variability (chance), such as that which comes from selecting different samples. On the other hand, if the confidence interval for the rate ratio does not include 1.0, its value is not attributed to chance and it is considered statistically significant.
Another way to determine whether a particular statistic (let us stick to rate ratios) is bigger or smaller than 1.0 is to perform a statistical test. A statistical test is a more formal statistical procedure that computes a statistic under the assumption that some null hypothesis is true. A typical null hypothesis might be: there is no difference in mortality rate between group A and group B (in other words, the rate ratio is equal to 1.0). If the statistic is "unusual," then the null hypothesis is rejected. The measure of "unusual" is called a p-value. Customarily, a p value of less than 0.05 is considered "unusual." For example, take the above null hypotheses of no difference between mortality rates in groups A and B; i.e, the rate ratio is 1.0. If observed data yield an actual rate ratio of 1.5, for instance, and an associated test statistic with a p-value less than 0.05, then we reject the null hypothesis and conclude that such a high risk ratio is unlikely (only 5 times out of 100) to be due to chance.
Finally, we need to examine a little more what "unlikely to be due to chance" means in a larger context. By custom, a value is called statistically significant if the operation of chance will produce such a value only about 5 times in 100. However, just as in the case of repeated samples, repeated analyses of different data (for example, death rates due to cancer, to heart disease, to respiratory disease, etc.), every one involving a statistical test will carry an individual 5 percent risk of labeling a statistic significant when its increased or decreased value was actually due to chance.
Moreover, if we do many such analyses, that 5 percent risk for each one mounts up. For example, if one does 20 statistical tests of rate ratios, it is quite likely that there will be at least one rate ratio labeled statistically significant just by the operation of chance. This analytic problem is known as the multiple comparisons problem.
Because the greater the number of statistical tests, the more findings are labeled statistically significant due to chance, efforts are made to limit the number of statistical tests. This is usually done by specifying in advance a relatively small number of tests, directed at a limited number of research questions. Nevertheless, there are also times—for example, when one is interested in completely describing all the data, say, looking at a complete list of
causes of death, whether or not one suspects that any of these rates are elevated—when many independent tests are made. In these situations, it is especially important to keep in mind the possibility that statistically significant rate ratios may be labeled so merely due to chance.
At the same time, one must consider that a true association may fail to test as statistically significant by chance or because of lack of statistical power. The power of a study to detect a real association (if there were one) depends on sample size, the incidence of the outcome in the absence of exposure, and the strength of association between the exposure and the outcome.
In considering whether an observed association makes sense causally, epidemiologists consider the temporal relationship between the factors (e.g., if described appropriately, an outcome cannot precede a cause), the biologic plausibility of the association, and its coherence with a range of other related knowledge (radiation biology, for example). No one of these factors is necessarily sufficient to prove causation. In fact, causation cannot actually be proven; it can only be supported (weakly or strongly) or contradicted (weakly or strongly).
Epidemiology uses numbers, going to extreme lengths at times to "split hairs" and "search under rocks," yet relies on judgment for interpretation. It is hoped that the considered judgments of epidemiologists will be useful to the judgment of clinicians in making treatment decisions and of policymakers in making legislation and regulatory and procedural decisions.
Epidemiology Summary Related to this Study
This is a report of a retrospective cohort study comparing military participants in CROSSROADS with military nonparticipants who are similar in age, rank-rating, military occupation, time frame of service, and sex. To more accurately measure exposure, we developed and used criteria for those participants most likely to have been more highly exposed. The study design calls for tight controls on the selection process for assignment to participant or comparison groups, data access, and data follow-up.
The endpoints considered are mortality rates. Specific causes of death were chosen based on understanding of disease process and a priori expectations based on knowledge and suspicion of radiation effects.
This study will not say whether Private Rogers, Rodriguez, or Rosenthal died of cancer because of Operation CROSSROADS. It may be able to say that the rate of cancer among all CROSSROADS participants was—or was not—different from the rate of cancer among comparable nonparticipants. Whether associations are reported with relative surety or uncertainty depends on the data themselves and on statistical techniques for sifting the wheat from the chaff. If
this were easy, we would not still be studying and arguing about radiation effects.
The Medical Follow-up Agency of the Institute of Medicine, National Academy of Sciences, conducted the study, relying, as necessary, on records maintained by government and private groups. MFUA is itself "disinterested" in that it stands to neither lose nor gain from its findings in this study: it will neither receive nor be denied compensation, nor will it be held fiscally or programmatically responsible for such compensation or related care. Because this study (not unlike many other studies of human suffering and possible blame and responsibility) has an historical overlay of tremendous emotion and distrust, we must be especially careful to follow generally accepted ground rules for valid studies and to describe openly our rationale for various decisions throughout.