Modeling Incidence and Mortality Data in an Ecologic Study
A starting point for ecologic modeling of cancer rate is Poisson regression for rates and counts. In classic Poisson regression, a count, Ni of some data item (e.g., a count of childhood leukemias) is modeled as a Poisson random variable, with a probability distribution function equal to:
Here μi is the expected value of Ni (i.e., the number of cancer incident cases or deaths in a particular geographic unit expected from broad population rates, typically cross-classified by other variables such as age, gender, and race/ethnicity with i as the identifying index). In Poisson regression the mean, mi, is unknown but assumed to be a function of known covariates. For example, in generalized linear regression (McCullagh and Nelder, 1989) a model for the mean involves a covariate vector Xi = (Xi1,Xi2,…,Xip)T observed for each i. These Xi may be either continuous variables, such as dose, or indicator variables, indicating levels taken by categorical variables. The generalized linear model for mi is of form:
Here α = (α1,α2,…,αp)T and α1 is the regression coefficient relating covariate value Xi1 to the mean μi, α2 relates Xi2 to mi, etc. Here g is a link
function, for example when (as is often the case) g is the log function then the model is equivalent to:
When Ni counts the number of events observed over a period of time, ti (years), for a known number of individuals, ki, then the person-years of observation, pyi, defined as tiki will be made a part of model as:
so that the mean of the counts is proportional to the person-years of observation multiplied by the effect of covariates.
In the setting described here Ni would correspond to a single entry in a cross-tabulation of events (death due to or incidence of a particular cancer) by each geographical unit, and by gender, race, age, calendar time, and any other relevant variable known (from the cancer registry) about the cases. For each cell in the table the number of events and person-years at risk, pyi, are required to be calculated (see discussion below) in addition the variable of interest, dose Di, and other covariates available for each geographical unit (i.e., indices of social economic status) are required for each table entry i.
A variation on model, known as the linear excess relative risk (ERR) model, is commonly used in radiation epidemiology. The linear ERR model incorporates dose in the model for mi as:
Here pyi exp(XiTa) is the background rate of disease (for unexposed cells), multiplied by person-years at risk, and the ERR parameter β is the excess relative risk associated with dose or dose surrogate Di. Much more complex models can be considered and software for generalized Poisson regression is available (Epicure, Hirosoft Software, Seattle, Washington). The background rate of disease is allowed to vary depending on race, gender, age, and calendar time (to allow for disease rates to differ by age and for age-specific rates to vary by calendar year, for example). Covariates in ecologic models are not individual covariates, but instead are summaries obtained for each geographical unit, although these can also vary in time; for example, we may have information about some socioeconomic variables at the level of census tract and these variables may change with time over the period of interest. Such variables are incorporated by including (categories of) calendar time as a cross-classification variable.
J.1 DOSE AND DOSE SURROGATES
The presumed effect on risk of the dose or dose surrogate variable, Di, in model is much simpler (involving only the ERR parameter, b) than the model for the background risk (involving many additional parameters a); however, Di will also vary in time. For example, if Di is cumulative dose from a particular nearby plant for representative individuals, then Di for all census tracts near that plant would be zero until the start of operations of that plant and would accumulate in time during operation. Even treatment of much simpler dose surrogates (exposed or not exposed according to distance) should reflect startup times of each plant or facility.
Other factors may also need to be considered in the calculation of Di; for example, if it is known that a population around a particular plant or facility has been highly mobile over the period of exposure then it would be desirable to incorporate that mobility into the calculation of Di in order to approximate the average cumulative dose to the individuals in each census tract for each time period considered. If distance is to be used as a dose surrogate then time-weighted distance could also be considered.
J.2 PERSON-YEAR CALCULATIONS
Another key issue in Poisson modeling is to adequately approximate person-years of exposure to some hazard, pyi, as well as counting the number of events Ni. For each cell in the tabulation of events cross-classified by geographical unit, race, age, and calendar time, census data are required in order to determine the population size for each table entry, i.e., the whole population must be classified according to these same variables. Data from each decennial census must be interpolated to the out years. The accuracy of person-year approximations affect the modeling of Ni using Poisson regression and inaccuracies in estimation of person-years is one (among many) reasons to assume that the Poisson model may not adequately capture the variability of the observed counts Ni.
J.3 OVERDISPERSION
It is likely that observed counts Ni will depart from the Poisson regression distribution in a way that must be adequately accommodated when fitting the regression models such as (5). If a random variable is distributed according to the Poisson distribution then the variance of Ni is also equal to mi. However, there are good reasons why we expect that the actual variability of Ni will be greater than that predicted by Poisson distribution. For example, as mentioned above, for the out years at least, the population size and hence person-years will not be known exactly. Even more importantly, however, is that other known and unknown risk factors that influence disease
occurrence are not being accounted for in the variables that are used in the ecologic regression. Even if those risk factors are completely independent of distance or dose from a plant or facility then they will still increase the dispersion of Ni while leaving the model for the mean unaffected. Ignoring overdispersion will lead to underestimation of standard errors of the estimates of the regression parameters, including those of most interest (i.e., b). The treatment of overdispersion in Poisson regression models has been considered by a number of authors (Liu and Pierce, 1993; McCullagh and Nelder, 1989; Moore, 1986). A simple and usually effective approach (McCullagh and Nelder, 1989) to solving this problem is to fit the means model using Poisson regression but then to estimate an overdispersion term s2 with s2 1 so that the variance of Ni is estimated to be equal to s2mi. Inference about the significance of the parameters of interest (i.e., b) is performed after adjusting the usual standard error estimates (assuming the Poisson model). A method of moments approaches for fitting this and similar models is described by Moore (1986). More generally, the “sandwich estimator” of Zeger and Liang (1986) can be used to compute variances of the parameter estimates that adequately reflect the variability of the counts. The overall approach described above relates observed disease rates to distance or other dose surrogates in a systemic way, i.e., addressing the question of whether or not disease risk appears to be associated with proximity to a nuclear facility, or to other dose surrogates, averaging over all the facilities. For some common cancers it will be possible to consider site-specific analyses, i.e., whether proximity to a specific facility or plant is associated with risk. Such analyses are subject to concerns about multiple comparisons (as described in the main text) but may also be particularly sensitive to the problem of overdispersion described above. If one uses an uncorrected test, i.e., a test based upon the assumption that the Poisson distribution holds exactly, then it is very likely that there will be some sites where for some cancers proximity is “significantly” associated with risk, but for which the inference differs greatly depending upon whether or not purely Poisson variation of counts is assumed. The estimation of overdispersion terms s2
1 (or providing other treatment of overdispersion as in a random effects analysis) is crucial in order to avoid overinterpretation of random fluctuation that simply are greater in magnitude (due to unmeasured characteristics affecting disease risk) than expected under the Poisson model. These problems appear in many different kinds of settings and have been described by a number of different authors (Efron, 1992). Modeling of both the mean (as in equation (5) of the appendix) and the variance of counts will be essential in ensuring that unrealistic inference from fitting these models is avoided; this is true both for the overall analysis of risk in relation to plant proximity and especially for site-specific analyses.
REFERENCES
Efron, B. (1992). Poisson overdispersion estimates based on the method of asymmetric maximum likelihood. JASA 87.
Liu, Q., and D. A. Pierce (1993). Heterogeneity in Mantel-Haeszel-type models. Biometrika 80(3):543-556.
McCullagh, P., and J. Nelder (1989). Generalized linear models, 2nd edition. Boca Raton, FL: CRC Press.
Moore, D. F. (1986). Asymptotic properties of moment estimates for overdispersed counts and proportions. Biometrika 73(3):583-588.
Zeger, S., and K. Liang (1986). Longitudinal analysis for discrete and continuous outcomes. Biometrics 42:121-130.
This page is blank