**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

Modeling Incidence and Mortality Data in an Ecologic Study

A starting point for ecologic modeling of cancer rate is Poisson regression for rates and counts. In classic Poisson regression, a count, *N _{i}* of some data item (e.g., a count of childhood leukemias) is modeled as a Poisson random variable, with a probability distribution function equal to:

Here *μ _{i}* is the expected value of

*N*(i.e., the number of cancer incident cases or deaths in a particular geographic unit expected from broad population rates, typically cross-classified by other variables such as age, gender, and race/ethnicity with

_{i}*i*as the identifying index). In Poisson regression the mean,

*m*, is unknown but assumed to be a function of known covariates. For example, in generalized linear regression (McCullagh and Nelder, 1989) a model for the mean involves a covariate vector

_{i}*X*= (

_{i}*X*

_{i}_{1},

*X*

_{i}_{2},…,

*X*)

_{ip}^{T}observed for each

*i*. These

*X*may be either continuous variables, such as dose, or indicator variables, indicating levels taken by categorical variables. The generalized linear model for

_{i}*m*is of form:

_{i}Here *α* = (*α*_{1},*α*_{2},…,*α _{p}*)

^{T}and

*α*

_{1}is the regression coefficient relating covariate value

*X*

_{i}_{1}to the mean

*μ*,

_{i}*α*

_{2}relates

*X*

_{i}_{2}to

*m*, etc. Here

_{i}*g*is a link

**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

function, for example when (as is often the case) *g* is the log function then the model is equivalent to:

When *N _{i}* counts the number of events observed over a period of time,

*t*(years), for a known number of individuals,

_{i}*k*, then the person-years of observation,

_{i}*py*, defined as

_{i}*t*will be made a part of model as:

_{i}k_{i}so that the mean of the counts is proportional to the person-years of observation multiplied by the effect of covariates.

In the setting described here *N _{i}* would correspond to a single entry in a cross-tabulation of events (death due to or incidence of a particular cancer) by each geographical unit, and by gender, race, age, calendar time, and any other relevant variable known (from the cancer registry) about the cases. For each cell in the table the number of events and person-years at risk,

*py*, are required to be calculated (see discussion below) in addition the variable of interest, dose

_{i}*D*, and other covariates available for each geographical unit (i.e., indices of social economic status) are required for each table entry

_{i}*i*.

A variation on model, known as the linear excess relative risk (ERR) model, is commonly used in radiation epidemiology. The linear ERR model incorporates dose in the model for *m _{i}* as:

Here *py _{i}* exp(

*X*) is the background rate of disease (for unexposed cells), multiplied by person-years at risk, and the ERR parameter β is the excess relative risk associated with dose or dose surrogate

_{i}^{T}a*D*. Much more complex models can be considered and software for generalized Poisson regression is available (Epicure, Hirosoft Software, Seattle, Washington). The background rate of disease is allowed to vary depending on race, gender, age, and calendar time (to allow for disease rates to differ by age and for age-specific rates to vary by calendar year, for example). Covariates in ecologic models are not individual covariates, but instead are summaries obtained for each geographical unit, although these can also vary in time; for example, we may have information about some socioeconomic variables at the level of census tract and these variables may change with time over the period of interest. Such variables are incorporated by including (categories of) calendar time as a cross-classification variable.

_{i}**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

**J.1 DOSE AND DOSE SURROGATES**

The presumed effect on risk of the dose or dose surrogate variable, *D _{i}*, in model is much simpler (involving only the ERR parameter,

*b*) than the model for the background risk (involving many additional parameters

*a*); however,

*D*will also vary in time. For example, if

_{i}*D*is cumulative dose from a particular nearby plant for representative individuals, then

_{i}*D*for all census tracts near that plant would be zero until the start of operations of that plant and would accumulate in time during operation. Even treatment of much simpler dose surrogates (exposed or not exposed according to distance) should reflect startup times of each plant or facility.

_{i}Other factors may also need to be considered in the calculation of *D _{i}*; for example, if it is known that a population around a particular plant or facility has been highly mobile over the period of exposure then it would be desirable to incorporate that mobility into the calculation of

*D*in order to approximate the average cumulative dose to the individuals in each census tract for each time period considered. If distance is to be used as a dose surrogate then time-weighted distance could also be considered.

_{i}**J.2 PERSON-YEAR CALCULATIONS**

Another key issue in Poisson modeling is to adequately approximate person-years of exposure to some hazard, *py _{i}*, as well as counting the number of events

*N*. For each cell in the tabulation of events cross-classified by geographical unit, race, age, and calendar time, census data are required in order to determine the population size for each table entry, i.e., the whole population must be classified according to these same variables. Data from each decennial census must be interpolated to the out years. The accuracy of person-year approximations affect the modeling of

_{i}*N*using Poisson regression and inaccuracies in estimation of person-years is one (among many) reasons to assume that the Poisson model may not adequately capture the variability of the observed counts

_{i}*N*.

_{i}**J.3 OVERDISPERSION**

It is likely that observed counts *N _{i}* will depart from the Poisson regression distribution in a way that must be adequately accommodated when fitting the regression models such as (5). If a random variable is distributed according to the Poisson distribution then the variance of

*N*is also equal to

_{i}*m*. However, there are good reasons why we expect that the actual variability of

_{i}*N*will be greater than that predicted by Poisson distribution. For example, as mentioned above, for the out years at least, the population size and hence person-years will not be known exactly. Even more importantly, however, is that other known and unknown risk factors that influence disease

_{i}**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

occurrence are not being accounted for in the variables that are used in the ecologic regression. Even if those risk factors are completely independent of distance or dose from a plant or facility then they will still increase the dispersion of *N _{i}* while leaving the model for the mean unaffected. Ignoring overdispersion will lead to underestimation of standard errors of the estimates of the regression parameters, including those of most interest (i.e.,

*b*). The treatment of overdispersion in Poisson regression models has been considered by a number of authors (Liu and Pierce, 1993; McCullagh and Nelder, 1989; Moore, 1986). A simple and usually effective approach (McCullagh and Nelder, 1989) to solving this problem is to fit the means model using Poisson regression but then to estimate an overdispersion term

*s*

^{2}with

*s*

^{2}1 so that the variance of

*N*is estimated to be equal to

_{i}*s*

^{2}

*m*. Inference about the significance of the parameters of interest (i.e.,

_{i}*b*) is performed after adjusting the usual standard error estimates (assuming the Poisson model). A method of moments approaches for fitting this and similar models is described by Moore (1986). More generally, the “sandwich estimator” of Zeger and Liang (1986) can be used to compute variances of the parameter estimates that adequately reflect the variability of the counts. The overall approach described above relates observed disease rates to distance or other dose surrogates in a systemic way, i.e., addressing the question of whether or not disease risk appears to be associated with proximity to a nuclear facility, or to other dose surrogates, averaging over all the facilities. For some common cancers it will be possible to consider site-specific analyses, i.e., whether proximity to a specific facility or plant is associated with risk. Such analyses are subject to concerns about multiple comparisons (as described in the main text) but may also be particularly sensitive to the problem of overdispersion described above. If one uses an uncorrected test, i.e., a test based upon the assumption that the Poisson distribution holds exactly, then it is very likely that there will be some sites where for some cancers proximity is “significantly” associated with risk, but for which the inference differs greatly depending upon whether or not purely Poisson variation of counts is assumed. The estimation of overdispersion terms

*s*

^{2}1 (or providing other treatment of overdispersion as in a random effects analysis) is crucial in order to avoid overinterpretation of random fluctuation that simply are greater in magnitude (due to unmeasured characteristics affecting disease risk) than expected under the Poisson model. These problems appear in many different kinds of settings and have been described by a number of different authors (Efron, 1992). Modeling of both the mean (as in equation (5) of the appendix) and the variance of counts will be essential in ensuring that unrealistic inference from fitting these models is avoided; this is true both for the overall analysis of risk in relation to plant proximity and especially for site-specific analyses.

**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

**REFERENCES**

Efron, B. (1992). Poisson overdispersion estimates based on the method of asymmetric maximum likelihood. *JASA* 87.

Liu, Q., and D. A. Pierce (1993). Heterogeneity in Mantel-Haeszel-type models. *Biometrika* 80(3):543-556.

McCullagh, P., and J. Nelder (1989). *Generalized linear models*, 2nd edition. Boca Raton, FL: CRC Press.

Moore, D. F. (1986). Asymptotic properties of moment estimates for overdispersed counts and proportions. *Biometrika* 73(3):583-588.

Zeger, S., and K. Liang (1986). Longitudinal analysis for discrete and continuous outcomes. *Biometrics* 42:121-130.

**Suggested Citation:**"Appendix J: Modeling Incidence and Mortality Data in an Ecologic Study." National Research Council. 2012.

*Analysis of Cancer Risks in Populations Near Nuclear Facilities: Phase 1*. Washington, DC: The National Academies Press. doi: 10.17226/13388.

This page is blank