Read "Understanding and Communicating Reliability of Crash Prediction Models" at NAP.edu

« Previous: References

Page 103

Suggested Citation:"Appendix A: The Development of Procedures for Quantifying the Reliability of Crash Prediction Model Estimates with a Focus on Mismatch Between CMFs and SPF Base Conditions ." National Academies of Sciences, Engineering, and Medicine. 2021. Understanding and Communicating Reliability of Crash Prediction Models. Washington, DC: The National Academies Press. doi: 10.17226/26440.

Page 104

Page 105

Page 106

Page 107

Page 108

Page 109

Page 110

Page 111

Page 112

Page 113

Page 114

Page 115

Page 116

Page 117

Page 118

Page 119

Page 120

Page 121

Page 122

Page 123

Page 124

Page 125

Page 126

Page 127

Page 128

Page 129

Page 130

Page 131

Page 132

Page 133

Page 134

Page 135

Page 136

Page 137

Page 138

Page 139

Page 140

Page 141

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

101 Appendix A: The Development of Procedures for Quantifying the Reliability of Crash Prediction Model Estimates with a Focus on Mismatch Between CMFs and SPF Base Conditions

102 CONTENTS BACKGROUND .................................................................................................................................... 103Â ObjectiveÂ andÂ ScopeÂ ........................................................................................................................................Â 104Â LITERATURE REVIEW ..................................................................................................................... 105Â MethodsÂ forÂ DevelopingÂ BaseÂ SPFsÂ ..................................................................................................................Â 105Â VarianceÂ ofÂ PredictedÂ ValueÂ ............................................................................................................................Â 105Â MODEL DEVELOPMENT .................................................................................................................. 109Â CaseÂ A.Â CMFÂ fromÂ PartÂ DÂ usedÂ withÂ SPFÂ (CMFÂ isÂ consistentÂ withÂ SPFÂ baseÂ conditions)Â ......................................Â 109Â CaseÂ B.Â CMFsÂ DoÂ NotÂ HaveÂ aÂ CorrespondingÂ BaseÂ ConditionÂ inÂ theÂ SPFÂ ............................................................Â 111Â CaseÂ C.Â CMFÂ NotÂ UsedÂ inÂ CPMÂ ButÂ BaseÂ ConditionÂ AccommodatedÂ inÂ theÂ SPFÂ ..................................................Â 121Â EXPERIMENTAL DESIGN AND ANALYSIS RESULTS ............................................................... 128Â SiteâLevelÂ DatabaseÂ .........................................................................................................................................Â 129Â DataÂ CollectionÂ ProcessÂ ...................................................................................................................................Â 130Â EvaluationÂ DatabaseÂ .......................................................................................................................................Â 133Â AnalysisÂ ResultsÂ ..............................................................................................................................................Â 134Â REFERENCES FOR APPENDIX A .................................................................................................... 138Â

103 Background The CPMs in Part C of the Highway Safety Manual (HSM) are used to estimate the predicted average crash frequency of a site with specific geometric design elements and traffic control features. Each CPM has the following general form: Equation 1 ð ð¶ ð ð¶ðð¹ . . . ð¶ðð¹ where Np = predicted average crash frequency, crashes/yr; C = local calibration factor; NSPF = predicted crash frequency for site with base conditions, crashes/yr; CMFi = HSM Part C crash modification factor for geometric design element, or traffic control feature i (i = 1 to n); and n = total number of HSM Part C CMFs. Each CPM includes a safety performance function (SPF), one or more crash modification factors (CMFs), and a local calibration factor (C). The SPF is used to predict the crash frequency NSPF for a site having characteristics that match a specified set of âbase conditionsâ that describe its design elements and control features (e.g., 12 ft lane width). The set of CMFs are used to adjust NSPF such that the CPM can provide reliable estimates of the predicted crash frequency Np for sites with a wide range of characteristics. One base condition value is specified for each variable represented in the collective set of CMFs in the CPM. These variables are referred to herein as âbase variables.â The set of specified values are referred to as âbase condition valuesâ for the SPF. Each Part C chapter lists the base variables and the base condition values associated with the CPMs in that chapter. The SPF is developed to include average annual daily traffic (AADT) volume and segment length as variables. Thus, AADT and segment length are not considered (or used to define) SPF base conditions. The CPM can be used to evaluate any given site having known values for the geometric design elements and traffic control features associated with a base variable. When the site of interest has an element or feature whose value equals the base condition value, the corresponding CMF has a value of 1.0. When an element or feature has a value that is different from the base condition value, the corresponding CMF has a value that is different from 1.0. Typically, there is one CMF for each base variable. If the variable is continuous (e.g., lane width), the associated CMF includes the variable. If the variable is discrete (e.g., add lighting), then the associated CMF is a constant (e.g., 0.90 for add lighting) and the base condition is inferred from the CMF description (e.g., base condition is âno lighting presentâ). Part C (Section C.7) of the HSM describes four methods for estimating the average crash frequency for a site. The methods are indicated to provide different levels of predictive reliability. However, the HSM does not quantify the reduction in reliability associated with each of the four methods, so analysts do not have the information they need to make an informed choice among the methods. These methods are identified in the following list in order of predictive reliability, with the most reliable method listed first. ï· Method 1 â Apply the Part C CPM to evaluate the existing and proposed conditions. ï· Method 2 â Apply the Part C CPM to evaluate the existing condition. Use a Part D CMF with the Part C CPM to evaluate the proposed condition. ï· Method 3 â Apply a jurisdiction-specific SPF to evaluate the existing condition. Use a Part D CMF with this SPF to evaluate the proposed condition. ï· Method 4 â Use observed crash frequency to evaluate the existing condition. Use a Part D CMF with the observed crash frequency to evaluate the proposed condition.

104 Method 1 relates to the use of CPMs wherein there is a âbalanceâ between the crash modification factors (CMFs) and the base conditions associated with the safety performance function (SPF). In this regard, a balanced CPM application occurs when the set of CMFs used collectively match all of the SPFâs base conditions. Method 2 presents the situation where there is a lack of balance between the CMFs used and the SPF. In this situation, the Part C SPF base conditions do not include those associated with the Part D CMF. For example, consider the case where the analyst wishes to evaluate the proposed addition of a flashing beacon at an intersection â a treatment for which there is a CMF in Part D. However, the Part C CPM used does not include a base condition relating to flashing beacon presence and, thereby, there is a small reduction in the reliability of the predicted crash frequency. Method 3 can be implemented in one of two ways. In the first way, there is a lack of balance between the CMF obtained from Part D and the jurisdiction-specific SPF (similar to that described for Method 2). This variation of Method 3 also corresponds to Factor 2 of Table 3 in Chapter 3 of the main body of the report. In the second way, there is a balance between the CMF from Part D and the jurisdiction-specific SPF. This variation of Method 3 corresponds to Factor 1 of Table 3 in Chapter 3. There is a variation of Methods 1 or 2 that can sometimes occur in application. For this variation, one or more CMFs that are part of the Part C CPM are not used. This situation may occur when the analyst is interested in using the CPM to evaluate a site but does not have ready access to the data needed for one or more of the other CMFs in the CPM. For example, consider an analyst using the Rural Two-Lane Two- Way Road CPM in HSM Chapter 10 to evaluate the safety benefit associated with various shoulder width alternatives for a 30-mile highway section. The analyst does not have access to curve radius, curve length, and spiral presence data for the section so he or she decides to not use the CMF for horizontal curvature in the CPM. Method 4 is not based on the use of an SPF. Rather, the observed crash frequency for the site of interest is used to estimate the expected crash frequency. Thus, the issue of CMF-SPF balance does not apply. For this reason, the reliability of Method 4 is not addressed in this Appendix. Objective and Scope The previous section examined HSM Analysis Methods 1, 2, and 3. It also identified how the implementation of any one of these methods could degrade the reliability of the results. The findings from this discussion are summarized in Table 1 in the form of âapplication cases.â The objectives of this Appendix are to (1) examine the three cases listed in Table 1 and quantify their influence of on reliability, and (2) develop equations for quantifying this influence. Table 1. Summary of applications associated will possible bias in predicted value. Case Description Associated HSM Method CMF-SPF Balance A CMF from Part D used with jurisdiction-specific SPF (CMF is consistent with SPF base conditions) 3 Yes B One or more CMFs used in the CPM do not have a corresponding base condition in the SPF 2, 3 No C One or more CMFs are not used in the CPM yet the corresponding base condition exists in the SPF 1, 2 No

105 Literature Review The reliability of the prediction from a CPM can be described in terms of bias, variance, and repeatability. Bias represents the difference between the CPM estimate and the true value. Variance describes the extent of uncertainty in the estimate due to unexplained or random influences. Repeatability describes the extent to which multiple analysts using the same CPM with the same training, data sources, and site of interest obtain the same results (as measured by the number of significant figures showing agreement among results). A more reliable estimate has little bias, a smaller variance, and is likely to have results that show several significant figures in agreement (should there be repeated independent applications). The effect of bias and variance can be mathematically combined to compute the mean square error of an estimate (iTrans 2006). The equation for this calculation is provided below. Equation 2 ðððð ððð¢ððð ð¸ðððð ðððððððð ðµððð This equation indicates that a CPM estimate that has a small mean square error has a small variance and a small bias. In other words, a small mean square error implies that the estimate is more reliable because it is both precise and accurate. Methods for Developing Base SPFs The HSM Appendix to Part C describes two methods for developing SPFs for use with the CPMs described in Part C. One method is based on the use of a âbase-condition database.â This database includes only the AADT and segment length for sites whose geometric design elements and traffic control features match the base condition values for the specified base variables. The SPF coefficients are estimated using regression analysis with the base-condition database. This SPF is referred to herein as a âbase-condition- database SPF.â The second method is based on the use of a âmultiple-variable database.â This database includes the AADT and segment length as well as all base variables. Initially, a regression model is developed to include all significant database variables. Next, the SPF is made applicable to the base conditions by (1) substituting values in the regression model variables that correspond to the base condition values and (2) reducing the model to include only AADT and segment length. This SPF is referred to herein as a âmultiple-variable- database SPF.â Variance of Predicted Value Equations are available in the literature for estimating the variance of the predicted estimate from a CPM (Wood, 2005; Lord, 2008). Table 2 lists the equations used with a CPM that has a log-linear model form and that is derived using a negative-binomial distribution for the observed crash data. The equations associated with Condition 1 are based on the assumption that there is no variation in the coefficient estimates (i.e., the coefficients in the SPF and the CMFs). The variance in the predicted average crash frequency is attributed to the random variation in the mean crash frequency among sites. The overdispersion parameter k is used to describe this variation. It will typically decrease in value as additional base condition variables are added to the CPM (and these variables are found to be statistically significant) (Miaou, 1996). As indicated by the equations, a decrease in this parameter will decrease the uncertainty of the predicted mean crash frequency.

106 Table 2. Variance estimates for various assumed conditions. Assumed Condition Variance of the Predicted Average Crash Frequency Variance of the Prediction for a New Observation 1. No variation in coefficient estimates. ð , ð ð ð , ð ð , 2. Variation in coefficient estimates and correlation among independent variables considered. (Wood, 2005) ð , ð ð ð ð ð ð ð ð with, ð ð ð ð¼ ð ð¼ ððð¸ ð ð ððð¸ 1ð ð ð ð , ð ð , 3. Variation in coefficient estimates considered and it occurs independently (CMFs shown are âexternalâ to CPM). (Lord, 2008) ð , ð , ð . ð¶ðð¹ ð ð¶ðð¹ ð¶ðð¹ ð ð¶ðð¹ â¦ ð , ð¶ðð¹ ð¶ðð¹ â¦ with, ð , ð , ð ð ð ð , ð ð ð , ] ð , ð , ð , ð¶ðð¹ ð ð¶ðð¹ ð¶ðð¹ ð ð¶ðð¹ â¦ ð , ð¶ðð¹ ð¶ðð¹ â¦ with, ð , ð , ð , where: Vm, i = variance of the predicted mean crash frequency based on condition i (i = 1, 2, 3); Vy, i = variance of the predicted for a new observation based on condition i (i = 1, 2, 3); Np,f = predicted average crash frequency (full model), crashes/yr; kf = overdispersion parameter associated with CPM producing Np; ð ð = variance of the linear terms (Î² X) in the log-linear CPM producing Np,r; Np,r = predicted average crash frequency (reduced model and external CMFs), crashes/yr;; kr = overdispersion parameter associated with the CPM producing Np,r; ð ð = variance of the linear terms (Î² X) in the reduced CPM producing Np,r; No = observed crash frequency, crashes/yr; ð¼ = variance-covariance matrix of the estimated coefficients; MSE = mean square error of residuals; m = number of observations; X = array of independent variables (see note); and Xh = array of independent variable values for the site for which the prediction variance is sought. Note: The relationships by Wood (2005) are based on the use of a log-linear model. Thus, the elements of the array of independent variables X correspond to their representation in a log-linear model. For example, consider a CPM with the form Np = exp[b0 + b1 Ln(AADT) + b2 (Wl â Wb) + b3 (Ilighting â Ibase)] where Wl = lane width; Wb = base lane width; Ilighting = 1 if lighting present, 0 otherwise; Ibase = proportion of sites with lighting). The X array for this model is: [ 1.0, Ln(AADT), (Wl â Wb), (Ilighting â Ibase) ]. The equations associated with Condition 2 are based on the likelihood that there is some uncertainty in the coefficient estimates, which translates into some additional uncertainty in the predicted mean crash frequency (Maher and Summersgill, 1996). To quantify this added uncertainty, the equations include the variance-covariance matrix of the estimated coefficients. This matrix incorporates the trend among variable pairs (if a trend is present) for one variableâs value to be related to the other variableâs value such that they do not vary independently. In general, an increase in the correlation among independent variables will increase the uncertainty of the predicted mean crash frequency. As with Condition 1, the overdispersion parameter k will typically decrease as additional base condition variables are added to the CPM (and these variables are found to statistically significant). As noted previously, a decrease in this parameter will decrease the uncertainty of the predicted mean crash frequency.

107 However, there is also a practical reality that the correlation among independent variables will increase as more variables are added to the CMF. As a result, there may be an âoptimumâ number of variables that yields the smallest uncertainty in the predicted mean crash frequency. By inspection of the equations in Table 2, it can be seen that the variance obtained using the Condition 2 equations will always exceed that obtained from the Condition 1 equations. The equations associated with Condition 3 are based on the assumptions that there is some uncertainty in the coefficient estimates and that this uncertainty varies independently among coefficients. In other words, this condition is based on the assumption that there is no correlation among the independent variables in the CMFs (or between these variables and those in the SPF). These assumptions result in the equations predicting a relatively large variance for the predicted mean crash frequency. For a given model estimation database, the predicted variance will equal or exceed that obtained from the equations for Condition 2. Using the equations for Conditions 2 or 3 will require estimation of the variance-covariance matrix of the coefficients (in addition to the predicted value Np and the overdispersion parameter k). The variance- covariance matrix of the can be obtained from a statistical software package capable of conducting a regression analysis using the CPM and a multiple-variable database. The variance-covariance matrix is an output from the software. Alternatively, the variance-covariance matrix can be computed directly using a spreadsheet. Comparison of Variances Lord et al. (2010) computed the variance of the predicted mean for two types of CPMs. One CPM was developed using a multiple-variable database and regression analysis. They referred to this CPM as a âfull modelâ because regression coefficients were included for all variables in the database. Specifically, the following variables were included in the CPM: AADT, lane width, shoulder width, and horizontal curve density. They did not convert the model into a CPM form (i.e., a multiple-variable-database SPF and inferred CMFs for lane width, shoulder width, and curve density); however, this conversion could have been undertaken without having any effect on their findings or conclusions. They used the equations associated with Condition 2 to estimate the variance of the predicted mean crash frequency for the full model. Lord et al. (2010) then used the same database to develop a base-condition-database SPF. They referred to this SPF as a âbaseline modelâ because the database included only sites having geometric design elements that matched the base condition values. They coupled this SPF with CMFs for lane width, shoulder width, and curve density. These CMFs were obtained from a research report; they were not derived from the database used to develop the aforementioned full-model CPM. They used the equations associated with Condition 3 to estimate the variance of the predicted mean crash frequency for the baseline model. Lord et al. (2010) compared the estimated variance associated with the two models. They found that the variance for the âfull modelâ CPM was much less than that for the âbaseline model SPF-with-CMFs.â This finding is consistent with the observations provided in the previous section. Overdispersion Parameter As indicated in Table 2, the overdispersion parameter has an important influence on the magnitude of the prediction variance. The overdispersion parameter for a given CPM is established when the CPM coefficients are estimated using regression analysis. This parameter will typically get smaller as the model fit to the data improves (Miaou, 1996). One means of improved fit can occur when an independent variable is added to the model. Evidence of the influence of additional model variables on the overdispersion parameter is shown in Table 3. Overdispersion parameters for 16 CPMs are listed in the last column. The parameters for any one combination of source, legs, control, and severity are obtained from a common database.

108 Table 3. Overdispersion parameter for several crash prediction models. Source Number of Legs Traffic Control Crash Severity Number of Added Variables1 Overdispersion Parameter Lord et al. (2008) Four Two-way stop All severities 0 0.6762 1 0.6441 2 0.6205 3 0.6238 Three All-way stop All severities 0 0.6659 1 0.6278 2 0.6023 Vogt (1999) Three Two-way stop Injury 0 0.5649 1 0.3787 3 0.2588 Four Two-way stop All severities 0 0.6144 2 0.4820 3 0.4183 4 0.3682 Note: 1 â Number of independent variables in CPM, but excluding those in the SPF (i.e., intercept, AADT). Lord et al. (2008) estimated four models using a database containing 267 intersections. The overdispersion parameters for these four models are listed in the first four rows. One model included only an intercept, a coefficient for major road AADT, and a coefficient for minor road AADT. It did not include coefficients for any other variables. The overdispersion parameter for this model is listed in the first row of Table 3. The model associated with the second row of the table included one variable for one geometric design element. A comparison of the overdispersion parameter in these two rows indicates that the additional variable (beyond those in the SPF) reduced the overdispersion parameter from 0.6762 to 0.6441. A similar pattern of âreduction in parameter value with an increase in variablesâ is shown in the table for the other models associated with each database. Both Lord et al. (2008) and Vogt (1999) developed many models in addition to those listed in Table 3. The overdispersion parameters show a consistent pattern of reduction with an increase in model variables. The trend was found to have the following empirical relationship: Equation 3 ð ð ð ð â with Equation 4 â 1 0.10 ð where kp = overdispersion parameter for model with p variables (excluding those in the SPF); k0 = overdispersion parameter for the model with 0 additional variables; p = number of variables in the model (excluding those in the SPF for intercept and AADT); Î0 = adjustment factor for b0; and b0 = base reduction factor for AADT-only model (= 0.10 for Vogt; 0.030 for Lord et al). Figure 1a shows the fit of Equation 3 to the data for 26 models (representing nine databases) that were collectively developed by Lord et al. (2008) and Vogt (1999). Each data point corresponds to one model. The form of Equation 3 indicates that each additional variable that is added to a model typically reduces the overdispersion parameter. However, the amount of reduction declines with each added variable.

109 a. Model fit to data. b. Predicted parameter value. Figure 1. Predicted overdispersion parameter as a function of number of variables. Equation 3 was used to compute the predicted parameter value for a hypothetical database with k0 = 0.67. The results are shown in Figure 1b. The trend lines show that the overdispersion parameter decreases in value as additional variables are added to the model. This finding is consistent with that of Miaou (1996). A steep decrease is likely a reflection of the addition of a variable that explains a large amount of the variability in the dependent variable. The rate of decrease is shown to become smaller with each additional variable, and the trend line flattens out (i.e., becomes horizontal) when there are about five variables added. This gradual flattening is likely the result of correlation among the independent variables included in the database. This trend is not intended to suggest that it is impossible to have CPMs with more than five meaningful variables. Rather, it suggests that further reduction in the trend lines will likely require either (1) the introduction of additional independent variables to the database that are not correlated with those in the model or (2) the restructuring of the database (i.e., strategically removing some existing sites, adding new sites) so that the existing correlation among independent variables is minimized. Model Development This section describes the development of models for predicting the reliability associated with each of the three cases listed in Table 1. For this purpose, reliability is described by both the amount of bias and the increased uncertainty in the predicted crash frequency, relative to that found in a CPM for which all CMFs (associated with the CPM) are used in a âMethod 1â application. When used in this manner, the CPMâs predicted crash frequency for a Method 1 application is assumed to have no bias. The insights obtained from the model development described in this section were used to develop an experimental design that supported the refinement and validation of the models. The experimental plan is described in the next main section. Case A. CMF from Part D Used with SPF (CMF is consistent with SPF base conditions) For this CPM application, a CMF from HSM Part D is used with a jurisdiction-specific SPF. The CMF of interest could also be obtained from another source (e.g., FHWA CMF Clearinghouse); however, this CMF will be referred to as the âHSM Part D CMFâ for consistency with HSM Method 3. For this application, there is a âbalanceâ between the CMF and the base conditions associated with the SPF. An example of this application is when an agency develops a jurisdiction-specific SPF for two-lane

110 highway segments with 12-ft lanes. It then uses the CMF from Part D for âmodify lane widthâ with this SPF to estimate the predicted crash frequency at sites with 11-ft lanes. Bias in the Predicted Value Case A is described by the use of a jurisdiction-specific SPF and a Part D CMF. They are used in the following equation to estimate the predicted crash frequency for a site. The SPF has base conditions that match those of the Part D CMF. Equation 5 ð ð , ð¶ðð¹ where NA = predicted average crash frequency from a Case-A application, crashes/yr; NSPF,j-s = predicted crash frequency for site with base conditions that are in balance with the CMF (jurisdiction-specific); and CMFD = HSM Part D crash modification factor for geometric design element, or traffic control feature of interest. To facilitate the calculation of bias in NA, Equation 5 is converted into the following model form: Equation 6 ð ð , ð with Equation 7 ð ð¿ð ð¶ðð¹ð ð where b = estimation coefficient; X = independent variable associated with CMF of interest; and Xbase = independent variable associated with the base condition for the CMF of interest. To illustrate Equation 7, consider the case where the CMF for 11-ft lanes is 1.05, relative to a base lane width of 12 ft. This equation indicates that b can be estimated as 0.0488 for this case. If the CMF is 0.9 for a discrete treatment (e.g., âadd beaconâ), then b is estimated as -0.105, where X = 1.0 and Xbase = 0.0. The method of statistical differentials can be used to determine the unbiased estimate of predicted crash frequency based on Equation 6. Details of this method can be found in Benjamin and Cornell (1970, p. 180). Application of this method indicates that the unbiased predicted crash frequency is estimated using the following equation: Equation 8 ð ð / ð with Equation 9 ð 1 0.5 ð ð , ð , where Np = predicted average crash frequency, crashes/yr; fA = bias adjustment factor for Case A; ð , = variance of the independent variable at sites of interest; and

111 ð , = variance of the independent variable at sites used to establish the base condition (i.e., those sites used to estimate the jurisdiction-specific SPF). The database used to estimate ð , is that database representing the sites of interest. The database used to estimate ð , is that database used to develop the jurisdiction-specific SPF. This value may be difficult to quantify if the SPF was developed in a previous time period such that the database is not available or it does not contain the variable of interest X. In this situation, it may be possible to identify the regions from which the data were obtained to develop the SPF and obtain values of the variable Xi at a representative set of sites. The value ð , would then be computed using this representative set of sites. When the adjustment factor is computed for discrete treatments (e.g., âadd beaconâ), an indicator variable is used to indicate treatment presence (where 1 corresponds to treatment present, and 0 corresponds to treatment not present). The variance of this independent variable is then computed using the 1, 0 values for the collective set of study sites. The bias that is introduced by a Case A application can be computed using the following equation for a representative set of sites. Equation 10 ðµððð 100 âð âðâð 100 ð 1 This bias can be removed if Equation 8 is used for a Case A application (i.e., the result from Equation 5 is adjusted using Equation 9, as shown in Equation 8). Values of the predicted bias are listed in Table 4 for typical values of standard deviation and estimation coefficient b. A standard deviation of less than 1.0 is often found for lane width and discrete treatments (e.g., âadd beaconâ). A standard deviation of 1.0 to 2.0 is often found for shoulder width and a standard deviation of 2.0 or more is often found for median width. A non-zero bias value indicates conditions where the estimate from Equation 5 is biased (as a result of a Case A application). The bias percentages in Table 4 tend to move further away from 0.0 with an increase in the absolute value of the estimation coefficient b. Thus, CMF values that are significantly different from 1.0 are more likely to be associated a larger bias. The bias tends to be larger when the standard deviation of the independent variable at the sites used to estimate the SPF (ð , ) is 0.0. Notably, a standard deviation of 0.0 is typically found for base-condition-database SPFs, which implies that this method for developing SPFs will produce SPFs with larger bias (if uncorrected) than those produced using a multiple-variable database. Case B. CMFs Do Not Have a Corresponding Base Condition in the SPF For this CPM application, one or more CMFs used with a CPM do not have a corresponding base condition in the SPF. These CMFs are called herein âexternal CMFsâ because their variables were not considered when the SPF was developed and its base conditions were established. An example of this application is when a Part D CMF is used with a Part C CPM (and the Part D CMFâs variables are not included in the CPMâs base conditions). The likely source of the CPM is HSM Part C; however, this procedure is sufficiently general that it can be applied to CPMs from other sources (e.g., a CPM developed for a specific jurisdiction).

112 Table 4. Predicted bias for Case A. Standard Deviation of Independent Variable... Percentage Bias by Estimation Coefficient b (illustrative CMF value in parentheses), percent At Sites of Interest, ðð,ð At Sites Establishing Base Condition, ðð,ð b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 0.5 0 0.0 0.1 0.3 0.5 0.5 0.0 0.0 0.0 0.0 1 -0.1 -0.4 -0.8 -1.5 2 -0.5 -1.9 -4.2 -7.5 1 0 0.1 0.5 1.1 2.0 0.5 0.1 0.4 0.8 1.5 1 0.0 0.0 0.0 0.0 2 -0.4 -1.5 -3.4 -6.0 2 0 0.5 2.0 4.5 8.0 0.5 0.5 1.9 4.2 7.5 1 0.4 1.5 3.4 6.0 2 0.0 0.0 0.0 0.0 Note: a positive percentage bias indicates that the estimate from Equation 5 is higher than the true value by the percentage indicated. Bias in the Predicted Value Case B is described in this section by the use of a CPM with one external CMF. They are used in the following equation to estimate the predicted crash frequency for a site whose characteristics (with one exception) are accounted for using the CMFs associated with the CPM. The exception characteristic is accounted for by the external CMF. Equation 11 ð ð¶ ð ð¶ðð¹ . . . ð¶ðð¹ ð¶ðð¹ where NB = predicted average crash frequency from a Case-B application, crashes/yr; CMFex = external crash modification factor (i.e., not associated with the SPFâs base conditions). There are two potential sources of bias in the predicted value. The first source is due to the variability of the independent variable associated with the external CMF. This source is described in the next subsection. The second source of bias is due to the possible differences between the average value of the CMFâs independent variable in the database used to calibrate the SPF and the average value of this variable at the sites of interest. This source of bias is described in the second subsection. Bias due to the Variability of the Independent Variable The bias due to the variability of the independent variable associated with the external CMF is computed using the same methods as used for Case A. Specifically, the method of statistical differentials was used to determine the unbiased estimate of predicted crash frequency. Its application indicates that the unbiased predicted crash frequency is estimated using the following equation: Equation 12 ð ð / ð with

113 Equation 13 ð 1 0.5 ð ð , Equation 14 ð ð¿ð ð¶ðð¹ð ð where Np = predicted average crash frequency, crashes/yr; fB = bias adjustment factor for Case B; ð , = variance of the independent variable at sites of interest; b = estimation coefficient; X = independent variable associated with CMF; and Xbase = independent variable associated with the base condition for the CMF of interest. Additional discussion on the use and meaning of Equation 14 is provided in the previous section associated with Case A. Bias due to the Difference Between the Independent Variableâs Average Value in Two Databases This section discusses the bias that is due to the possible differences between the average value of the CMFâs independent variable in the database used to calibrate the SPF and the average value of this variable at the sites of interest. The derivation of the equation for estimating this bias is based on the following equation, as applied to a representative set of sites. Equation 15 ðµððð 100 âð âðâð The CPM that predicts NB has been fit to a database without including the variables associated with the external CMF. As a result, it provides an unbiased estimate of the predicted crash frequency for the set of sites in the database when the external CMF is not used. This estimate is inferred to represent the average condition for each variable in the external CMF. Thus, the use of the external CMF with this CPM will introduce a bias whenever the sites of interest collectively have a non-average value for one or more of the external CMF variables. For example, consider a CPM that was calibrated using a database that included a mixture of sites with and without horizontal curves, but the CPM does not include a CMF for horizontal curvature (where curvature = 1/radius). The predicted average crash frequency from this CPM will represent a site having a radius equal to the average radius of the sites in the database. If an analyst obtains a CMF for curvature from an external source and uses it with the CPM at a specific site, the estimate NB will be larger than the true value if the radius is smaller than the average radius and vice versa if the radius is larger than average. Equation 15 reduces to the following equation for estimating the average percent bias when an external CMF is used with a CPM: Equation 16 ðµððð ðµ 100 âð¤ âð¤ð ð¶ðð¹ ð , âð¤ð ð¶ðð¹ðð¥ ðð âð¤ð 1 where CMFex(Xi) = external CMF value associated with variable Xi for site i in the database of sites of interest; CMFex(Xj,CPM) = external CMF value associated with the variable Xj for site j in the database used to develop the CPM;

114 wi = weight associated with site i; and wj = weight associated with site j. The average percent bias for a group of sites of interest is obtained by applying the CMF to each site and computing a weighted average of the values, where the weight used is the predicted average crash frequency based on the SPF. However, experience indicates that a reasonable estimate is obtained by using a weight equal to site exposure (i.e., entering volume for intersections and vehicle-miles for segments). If the external CMF is a function of variable X, an approximate estimate of the bias can be obtained using the average value of the independent variable X when estimating the CMF value. This approximation is shown using the following equation. Equation 17 ðµððð 100 ð¶ðð¹ ðð¶ðð¹ ð 1 where ð = average value of Xi for all sites of interest; and ð = average value of Xi for all sites used to develop the CPM. The database used to estimate ð is that database representing the sites of interest. The database used to estimate ð is that database used to develop the CPM. This latter value may be difficult to quantify if the CPM was developed in a previous time period such that the database is not available or it does not contain the variable of interest X. In this situation, it may be possible to identify the regions from which the data were obtained to develop the CPM and obtain values of the variable Xi at a representative set of sites. The value ð would then be computed using this representative set of sites. The use of Equation 17 can be demonstrated by an example. Consider the situation where the analyst is investigating the effect of reducing lane width on a roadway network. The analyst is planning to use a CPM that does not include lane width as a base variable. However, he or she has obtained an external CMF for lane width. This CMF has the following form. Equation 18 ð¶ðð¹ exp 0.03 ð 12 where CMFLW = CMF for lane width; and WL = lane width, ft. The average lane width at the sites in the road network is 10.9 ft. Examination of the original database used to develop the CMF indicates that the sites therein have an average lane width of 11.9 ft. The average percent bias in the estimates obtained from the CPM is computed by combining Equation 17 and Equation 18 as follows: Equation 19 ðµððð 100 exp 0.03 10.9 12exp 0.03 11.9 12 1 3.05% The sign associated with the bias estimate is positive. It indicates that the predicted average crash frequency obtained from the CPM (when used with the external CMF and applied to the subject road network) will be 3.05 percent high. As a second example, consider the situation where the analyst is investigating the effect of adding lighting to a roadway network. The analyst is planning to use a CPM that does not include âlighting presentâ as a base variable. However, he or she has obtained an external CMF for âadd lighting.â This CMF is has a constant value of 0.90.

115 None of the sites in the roadway network of interest has lighting. Examination of the original database used to develop the CMF indicates that one-half of the sites therein have lighting and the other one-half have no lighting. The average percent bias in the estimates obtained from the CPM is computed using Equation 16. For simplicity, the weight used in the calculation is the same for each site. The calculation of average bias is shown in the following equation. Equation 20 ðµððð 100 10.5 0.90 0.5 1.0 1.00 1 1 5.3% The sign associated with the bias estimate is positive. It indicates that the predicted average crash frequency obtained from the CPM (when used with the external CMF and applied to the subject road network) will be 5.3 percent high. It should be noted that if all of the sites in the original database had lighting present, the bias would be 11.1%. Combined Sources of Bias The two sources of bias can be combined using the following equation. Equation 21 ðµððð 100 ð ð¶ðð¹ ðð¶ðð¹ ð 1 If the external CMF has an exponential form (or if it can be converted to an equivalent exponential form) then the bias can be computed using the following equation. Equation 22 ðµððð 100 ð ðð¥ð ð ð ð 1 If the treatment is discrete (e.g., âadd beaconâ) then the associated CMF value can be converted to an approximately equivalent exponential form using Equation 14. This bias can be removed if the following equation is used for a Case B application (i.e., the result from Equation 11 is divided by the adjustment factors associated with the two sources of bias). Equation 23 ð ð / ð ðð¥ð ð ð ð To illustrate the use of Equation 22, consider the previous example roadway network for which lighting is of interest. The original database used to develop the CPM was examined and a variable for lighting presence was added (XCPM). A â1â was recorded for this variable if the site had lighting and a â0â was recorded if lighting was not present. The average value of this variable ð was computed as 0.5. The CMF obtained from the literature for âadd lightingâ is 0.90, which implies that the base condition is âno lighting present.â This information was used in Equation 14 to estimate b as -0.105 (= Ln[0.90]/[1 â 0]). None of the sites of interest has lighting, so the lighting-present variable equals 0 for each site. The average value of this variable ð is 0.0 and its standard deviation ð , is also 0. The bias adjustment factor fB in Equation 13 is computed as 1.0 (= 1 + 0.5[-0.1052 Ã 0.02]). With these values, Equation 22 is used to compute the average bias as 5.4 percent. This value is very similar to the 5.3 percent obtained using the CMF directly in Equation 20. Values of the predicted bias are listed in Table 5 for typical values of standard deviation and estimation coefficient b. A standard deviation of less than 1.0 is often found for lane width and discrete treatments (e.g., âadd beaconâ). A standard deviation of 1.0 to 2.0 is often found for shoulder width and a standard deviation of 2.0 or more is often found for median width. A non-zero bias value indicates conditions where the estimate from Equation 11 is biased (as a result of a Case B application).

116 Table 5. Predicted bias for Case B. Standard Deviation of Independent Variable at Sites of Interest, ðð,ð Difference between the Independent Variable in the Two Databases, ð¿ ð¿ðªð·ð´ Percentage Bias by Estimation Coefficient b (illustrative CMF value in parentheses), percent b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 0.5 0 0.0 0.1 0.3 0.5 0.5 -2.4 -4.8 -7.0 -9.1 1 -4.8 -9.4 -13.7 -17.7 2 -9.5 -18.0 -25.7 -32.6 1 0 0.1 0.5 1.1 2.0 0.5 -2.3 -4.4 -6.2 -7.7 1 -4.8 -9.1 -13.0 -16.5 2 -9.4 -17.7 -25.1 -31.6 2 0 0.5 2.0 4.5 8.0 0.5 -2.0 -3.0 -3.1 -2.3 1 -4.4 -7.7 -10.1 -11.6 2 -9.1 -16.5 -22.6 -27.6 Note: a positive percentage bias indicates that the estimate from Equation 11 is higher than the true value by the percentage indicated. The bias percentages in Table 5 tend to move further away from 0.0 with an increase in the absolute value of the estimation coefficient b. Thus, CMF values that are significantly different from 1.0 are more likely to be associated a larger bias. The bias tends to be larger when there is an increase in the difference between the average of the independent variable in the SPF estimation database and that at the subject sites. Bias in the Overdispersion Parameter In terms of the effect on the overdispersion parameter, the use of an external CMF with a CPM is effectively equivalent to adding a variable to the CPM (where the variable added is that associated with the external CMF). This action has parallels to regression analysis where the CPM is considered the âreducedâ model and the CPM-plus-external-CMF is the âfullâ model. The full model will demonstrate improved fit to the data, relative to the reduced model, if the variable added is statistically significant. As indicated by Equation 3 and the trends in Figure 1, this improved fit is indicated by a reduction in the overdispersion parameter. In practice, when there is a Case B application (e.g., HSM Method 2), the overdispersion parameter reported for the CPM is not adjusted by the analyst to reflect the presence of the external CMF. As a result, the overdispersion parameter in a Case B application is biased to a larger value than the true value. As indicated by the coefficients of Equation 3, a Case B application could result in the overdispersion parameter being biased high by up to 10 percent. To illustrate the implications of the trend in Figure 1b on a Case B application, consider the situation where the analyst is investigating the effect of reducing lane width on a roadway network. The analyst is planning to use a CPM that does not include lane width as a base variable. However, he or she has obtained an external CMF for lane width. The trend line labeled âVogt (1999)â in Figure 1b is used for this illustration to describe the database used to estimate the CPM. The CPM has two base condition variables and a reported overdispersion parameter of 0.512. This parameter value is shown to lie on the trend line in Figure 1b corresponding to a model with two variables (one variable for each base condition).

117 The CPM is used by the analyst with the external CMF. The analyst computes the variance of the predicted mean using the Condition 1 equations in Table 2 (i.e., ð , ð ð ) using k = 0.512. He or she also computes the expected average crash frequency using the empirical Bayes method. Again, the analyst uses the parameter 0.512 (associated with the CPM) when computing the expected average crash frequency. However, the trend in Figure 1b suggests that the true overdispersion parameter value for the CPM and external CMF combined is likely to be about 0.463 (i.e., a bias of 10 percent). That is, if the CPM were re- estimated to include the lane width CMF, the most likely estimate of the resulting overdispersion parameter is 0.463 (not 0.512). As a result of this bias associated with the Case B application, the computed variance and expected average crash frequency are also biased. The method of statistical differentials was used to determine the unbiased estimate of the overdispersion parameter for the full model (i.e., Equation 11), given that this parameter is known for the reduced model (i.e., Equation 1). This derivation is based on a log-linear representation of the full model. The method of statistical differentials indicates that the unbiased overdispersion parameter is estimated using the following equation: Equation 24 ð , ð ð ð , ðððððððð¡ððð ððððððð¡ððð where kp,f = predicted overdispersion parameter for full model; kr = overdispersion parameter for the reduced model (i.e., the k reported for the CPM); and ð , = variance of the independent variable (in the external CMF) at sites used to establish the CPM base conditions (i.e., those sites used to estimate the CPM regression coefficients). The predicted overdispersion parameter kp,f is an estimate of the overdispersion parameter that would be obtained if the CPM was developed as a fully-specified regression model that included the variable(s) in the external CMF. The correlation correction term in Equation 24 represents the correlation between the variable in the external CMF and the variables in the CPM. In the preceding discussion associated with Equation 3, it was noted that the correlation correction in two databases assembled for CPM development was fairly consistent and could be described using an empirically-based correction factor. This correction factor was shown in parenthesis in Equation 4. It can be mathematically added to Equation 24 to produce the following equation. Equation 25 ð , ð ð ð , â with Equation 26 â 1 0.10 2 ððð 5,ð 1 Equation 27 ð ð¿ð ð¶ðð¹ð ð where p is the number of empirically derived constants in the CMFs associated with the CPM plus those in the external CMF (i.e., exclude those in the SPF for intercept, AADT, and segment length); and all other variables as previously defined. If a CMF is associated with a discrete treatment (e.g., âadd beaconâ), then it is considered to have one variable (i.e., an indicator for treatment presence) when computing the number of variables p. To illustrate Equation 27, consider the case where the CMF for 11-ft lanes is 1.05, relative to a base lane width of 12 ft. This equation indicates that b is estimated as 0.0488 for this case. If the CMF is 0.9 for a discrete treatment (e.g., âadd beaconâ), then b is estimated as -0.105, where X = 1.0 and Xbase = 0.0.

118 The database used to estimate ð , is that database used to develop the CPM. This value may be difficult to quantify if the CPM was developed in a previous time period such that the database is not available or it does not contain the variable of interest X. In this situation, it may be possible to identify the regions from which the data were obtained to develop the CPM and obtain values of the variable Xi at a representative set of sites. The value ð , would then be computed using this representative set of sites. When the adjustment factor is computed for discrete treatments (e.g., âadd beaconâ), an indicator variable is used to indicate treatment presence (where 1 corresponds to treatment present, and 0 corresponds to treatment not present). The variance of this independent variable is then computed using the 1, 0 values for the collective set of study sites. The preceding discussion describes the bias due to a Case B application where the overdispersion parameter is not adjusted to reflect the presence of the external CMF. The following equation is offered for estimating this bias. Equation 28 ðµððð , 100 ð ð , ð , This bias can be removed if Equation 25 is used for Case B applications to predict kp,f . This predicted value can then be used to estimate the variance of the predicted crash frequency. It can also be used with the EB method to estimate the expected crash frequency. Values of the bias in the overdispersion parameter for a Case B application are listed in Table 6 for typical values of number-of-variables, standard deviation, and estimation coefficient b. A non-zero bias value indicates conditions where the reported overdispersion parameter kr is biased (as a result of a Case B application). The bias percentages in Table 6 tend to increase with an increase in the absolute value of the estimation coefficient b. Thus, external CMF values that are significantly different from 1.0 are more likely to be associated a larger bias. The bias tends to be larger with an increase in the standard deviation of the independent variable (in the external CMF) at the sites used to estimate the CPM base conditions (ð , ). The bias decreases when there are more CMFs (and associated variables) in the CPM.

119 Table 6. Overdispersion parameter bias for Case B. Standard Deviation of Independent Variable... Percentage Bias by Estimation Coefficient b (illustrative CMF value in parentheses), percent Number of Variables in the CMFs, p At Sites Establishing Base Condition, ðð,ð b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 1 0 0.0 0.0 0.0 0.0 0.5 0.1 0.5 1.0 1.8 1 0.5 1.8 4.2 7.8 2 1.8 7.8 19.3 40.4 2 0 0.0 0.0 0.0 0.0 0.5 0.1 0.4 0.8 1.4 1 0.4 1.4 3.3 5.9 2 1.4 5.9 14.4 28.9 3 0 0.0 0.0 0.0 0.0 0.5 0.1 0.3 0.6 1.0 1 0.3 1.0 2.3 4.2 2 1.0 4.2 9.9 19.0 Note: a positive percentage bias indicates that the estimate from Equation 25 is lower than the reported value by the percentage indicated. Percentages listed are based on kr = 0.50; other percentages will apply for other values of kr. Variance of the Predicted Value Table 2 describes a series of equations that can be used to compute the variance of the predicted crash frequency. Three different sets of equations are described and correspond to the stated assumptions in the first column of the table. Column 2 of the table provides equations for computing the variance of the predicted average crash frequency. These equations are discussed in this section. For a Case B application, an external CMF is used with a CPM (which means that the independent variable in the external CMF is not included in the SPFâs specified base conditions). The CPM in this application is referred to herein as the âreducedâ model because it lacks the independent variable associated with the external CMF. If the database used to develop the CPM could be expanded to include this independent variable, and the CPM coefficients re-estimated using regression analysis, the resulting model would be considered a âfullâ model for this discussion. The equations for computing the variance of the predicted average crash frequency associated with Assumed Condition 3 are appropriate for a Case B application because it considers the case of a reduced model combined with an external CMF. These equations are based on the foundational assumption (for Case B) that the correlation between the independent variable in the external CMF and those in the CPM is not known or quantifiable. The equations for computing the variance of the predicted average crash frequency associated with Assumed Condition 2 are considered to provide the most reliable estimates because they account for variation in the regression coefficient estimates and correlation among the independent variables. These equations can be used with the aforementioned full model to estimate the âtrueâ variance, which can then be compared with the estimates from Assumed Condition 3 using the reduced-model-plus-external-CMF. An examination of the equations associated with Conditions 2 and 3 indicates that the coefficient of variation (CV = standard deviation of predicted average crash frequency divided by the predicted average crash frequency) for both the reduced-model-plus-external-CMF and the full model is relatively constant for a wide range of independent variable values within a given database. It follows that the ratio of these

120 two CV values is also relatively constant for a range of variable values. The examination indicated that this ratio can be approximated by the following equation. Equation 29 ð¶ðð ð¶ðð¶ð ð ð . where CVRB = coefficient of variation ratio for Case B; CVr+CMF = coefficient of variation for equations associated with Assumed Condition 3 using reduced model and external CMF (= [Vm,3]0.5/[Np,r Ã CMFex]); CVf = coefficient of variation for equations associated with Assumed Condition 2 using full model (= [Vm,2]0.5/Np); kf = overdispersion parameter for full model; and kr = overdispersion parameter for the reduced model (i.e., the k reported for the CPM). Equation 29 can be combined with Equation 25 to estimate the CVR for typical conditions found in roadway safety data. The values obtained from Equation 29 are listed in Table 7. The ratio values are shown to equal or exceed 1.0. This result implies that the standard deviation of the predicted mean crash frequency (and associated confidence intervals) is larger for a Case B application, than it would be if the external CMF was incorporated into the CPM to produce a full model (i.e., the SPF was developed to include the external CMF variables as a base condition). This finding is consistent with that of Lord et al. (2010). Table 7. Coefficient of variation ratio for Case B. Standard Deviation of Independent Variable... Coefficient of Variation Ratio by Estimation Coefficient b (illustrative CMF value in parentheses), percent Number of Variables in the CMFs, p At Sites Establishing Base Condition, ðð,ð b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 1 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.01 1.01 1 1.00 1.01 1.02 1.04 2 1.01 1.04 1.09 1.19 2 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.00 1.01 1 1.00 1.01 1.02 1.03 2 1.01 1.03 1.07 1.14 3 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.00 1.01 1 1.00 1.01 1.01 1.02 2 1.01 1.02 1.05 1.09 Note: Ratios listed are based on kr = 0.50; larger ratios occur for smaller values of kr. As shown in Table 7, the ratio values increase with an increase in the absolute value of the estimation coefficient b. Thus, external CMF values that are significantly different from 1.0 are more likely to be associated a larger ratio. The ratio also tends to be larger with an increase in the standard deviation of the

121 independent variable (in the external CMF) at the sites used to estimate the CPM base conditions (ð , ). The ratio tends to decrease when there are more CMFs (and associated variables) in the CPM. The trend shown in Table 7 suggests that the variance of the predicted crash frequency for a Case B application will typically be larger than that obtained from a full model (i.e., a model where the external CMF has been incorporated into the CPM). However, the simplifying assumption associated with the Assumed Condition 3 equations (i.e., that the independent variable associated with the external CMF is not correlated with any of the independent variables in the CMF) preclude the examination of the impact of this correlation. It is possible that, if this correlation is large, the variance (and CV) associated with the full model may increase such that the CVR decreases. In cases of extremely large correlation, the CVR could be less than 1, which would imply that the equations associated with Assumed Condition 3 underestimate the variance of the predicted crash frequency. Case C. CMF Not Used in CPM But Base Condition Accommodated in the SPF For this CPM application, the analyst chooses not to use one or more of the CMFs that were included in the CPM when it was developed. All these CMFs (whether used or not used) have a corresponding base condition in the SPF. The unused CMFs are called herein âomitted CMFs.â An example of this application is when the analyst does not have ready access to the data needed for a CMF and, as a result, chooses not to use the CMF when evaluating one or more sites. The likely source of the CPM is HSM Part C; however, this procedure is sufficiently general that it can be applied to CPMs from other sources (e.g., a CPM developed for a specific jurisdiction). Bias in the Predicted Value Case C is described by the use of a CPM with one omitted CMF. The following equation is used to estimate the predicted crash frequency for a site whose characteristics (with one exception) are accounted for using the CMFs associated with the CPM. The exception characteristic is not considered because of the omitted CMF. Equation 30 ð ð¶ ð ð¶ðð¹ . . . ð¶ðð¹ ; ð¶ðð¹ ð¶ðð¹ where NC = predicted average crash frequency from a Case-C application, crashes/yr; CMFom = omitted crash modification factor (i.e., associated with the SPFâs base conditions but excluded from CPM). There are two potential sources of bias in the predicted value. The first source is due to the variability of the independent variable associated with the omitted CMF. This source is described in the next subsection. The second source of bias is due to the possible differences between the average value of the CMFâs independent variable in the database used to calibrate the SPF and the average value of this variable at the sites of interest. This source of bias is described in the second subsection. Bias due to the Variability of the Independent Variable The bias due to the variability of the independent variable associated with the external CMF is computed using the same methods as used for Case A. Specifically, the method of statistical differentials was used to determine the unbiased estimate of predicted crash frequency. Its application indicates that the unbiased predicted crash frequency is estimated using the following equation:

122 Equation 31 ð ð ð with Equation 32 ð 1 0.5 ð ð , Equation 33 ð ð¿ð ð¶ðð¹ð ð where Np = predicted average crash frequency, crashes/yr; fC = bias adjustment factor for Case C; ð , = variance of the independent variable at sites of interest; b = estimation coefficient; X = independent variable associated with CMF; and Xbase = independent variable associated with the base condition for the CMF of interest. Additional discussion on the use and meaning of Equation 33 is provided in a previous section associated with Case A. Bias due to the Difference Between the Independent Variableâs Average Value in Two Databases This section discusses the bias that is due to the possible differences between the average value of the CMFâs independent variable in the database used to calibrate the SPF and the average value of this variable at the sites of interest. The derivation of the equation for estimating this bias is based on the following equation, as applied to a representative set of sites. Equation 34 ðµððð 100 âð âðâð Equation 34 reduces to the following equation for estimating the average percent bias when an omitted CMF exists. Equation 35 ðµððð ð¶ 100 âð¤ð ð¶ðð¹ðð ðð,ð¶ðð âð¤ âð¤ð âð¤ð ð¶ðð¹ðð ðð 1 where CMFom(Xi) = omitted CMF value associated with variable Xi for site i in the database of sites of interest; CMFom(Xj,CPM) = omitted CMF value associated with the variable Xj for site j in the database used to develop the CPM; wi = weight associated with site i The average percent bias for a group of sites of interest is obtained by applying the CMF to each site and computing a weighted average of the values, where the weight used is the predicted average crash frequency based on the SPF. However, experience indicates that a reasonable estimate is obtained by using a weight equal to site exposure (i.e., entering volume for intersections and vehicle-miles for segments). If the omitted CMF is a function of variable X, an approximate estimate of the bias can be obtained using the average value of the independent variable X when estimating the CMF value. This approximation is shown using the following equation.

123 Equation 36 ðµððð 100 ð¶ðð¹ ðð¶ðð¹ ð 1 The use of Equation 36 can be demonstrated by an example. Consider the situation where the analyst is investigating the effect of reducing lane width on a roadway network. The analyst is planning to use a CPM that includes lane width and shoulder width as base variables. However, information about shoulder width in the network is not readily available so he or she decides not to use the shoulder width CMF in the CPM. The CMF for shoulder width has the following form. Equation 37 ð¶ðð¹ exp 0.032 ð 6 where CMFSW = CMF for shoulder width; and WS = shoulder width, ft. Using a sample of sites in the roadway network of interest, the average shoulder width is estimated to be 5.0 ft. Examination of the original database used to develop the CMF indicates that the sites therein have an average shoulder width of 6.0 ft. The average percent bias in the estimates obtained from the CPM (by excluding the shoulder width CMF) is computed by combining Equation 36 and Equation 37 as follows. Equation 38 ðµððð 100 ðð¥ð 0.032 6.0 6ðð¥ð 0.032 5.0 6 1 3.14% The sign associated with the bias estimate is negative. It indicates that the predicted average crash frequency obtained from the CPM (when shoulder width CMF is omitted and applied to the subject road network) will be 3.14 percent low. Combined Sources of Bias The two sources of bias can be combined using the following equation. Equation 39 ðµððð 100 ð¶ðð¹ ðð ð¶ðð¹ ð 1 If the external CMF has an exponential form (or if it can be converted to an equivalent exponential form) then the bias can be computed using the following equation. Equation 40 ðµððð 100 ðð¥ð ð ð ð /ð 1 If the treatment is discrete (e.g., âadd beaconâ) then the associated CMF value can be converted to an approximately equivalent exponential form using Equation 33. This bias can be removed if the following equation is used for a Case C application (i.e., the result from Equation 30 is multiplied by the adjustment factors associated with the two sources of bias). Equation 41 ð ð ð ðð¥ð ð ð ð To illustrate the use of Equation 40, consider a roadway network for which lighting is of interest. The original database used to develop the CPM was examined and a variable for lighting presence was identified (XCPM). A â1â was recorded for this variable if the site had lighting and a â0â was recorded if lighting was not added. The average value of this variable ð was computed as 0.5. In the CPM, the CMF for âadd lightingâ is 0.90 and the base condition is âno lighting present.â This information was used in Equation 33

124 to estimate b as -0.105 (= Ln[0.90]/[1 â 0]). None of the sites of interest has lighting, so the lighting-present variable equals 0 for each site. The average value of this variable ð is 0.0 and its standard deviation ð , is also 0. The bias adjustment factor fC in Equation 32 is computed as 1.0 (= 1 + 0.5[-0.1052 Ã 0.02]). With these values, Equation 40 is used to compute the average bias as -5.1 percent. When using the CPM without the âlightingâ CMF, the predicted value is lower than the true value by 5.1 percent. Values of the predicted bias are listed in Table 8 for typical values of standard deviation and estimation coefficient b. A standard deviation of less than 1.0 is often found for lane width and discrete treatments (e.g., âadd beaconâ). A standard deviation of 1.0 to 2.0 is often found for shoulder width and a standard deviation of 2.0 or more is often found for median width. A non-zero bias value indicates conditions where the estimate from Equation 30 is biased (as a result of a Case C application). Table 8. Predicted bias for Case C. Standard Deviation of Independent Variable at Sites of Interest, ðð,ð Difference between the Independent Variable in the Two Databases, ð¿ðªð·ð´ ð¿ Percentage Bias by Estimation Coefficient b (illustrative CMF value in parentheses), percent b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 0.5 0 0.0 -0.1 -0.3 -0.5 0.5 -2.5 -5.0 -7.5 -10.0 1 -4.9 -9.6 -14.2 -18.5 2 -9.5 -18.2 -26.1 -33.3 1 0 -0.1 -0.5 -1.1 -2.0 0.5 -2.6 -5.4 -8.3 -11.3 1 -5.0 -10.0 -14.9 -19.7 2 -9.6 -18.5 -26.7 -34.3 2 0 -0.5 -2.0 -4.3 -7.4 0.5 -3.0 -6.7 -11.2 -16.2 1 -5.4 -11.3 -17.6 -24.2 2 -10.0 -19.7 -29.1 -37.9 Note: a negative percentage bias indicates that the estimate from Equation 30 is lower than the true value by the percentage indicated. The bias percentages in Table 8 tend to move further away from 0.0 with an increase in the absolute value of the estimation coefficient b. Thus, CMF values that are significantly different from 1.0 are more likely to be associated a larger bias. The absolute value of the bias tends to be larger when there is an increase in the difference between the average of the independent variable in the SPF estimation database and that at the subject sites. Bias in the Overdispersion Parameter In terms of the effect on the overdispersion parameter, the omission of a CMF from a CPM is effectively equivalent to subtracting a variable from the CPM (where the variable subtracted is that associated with the omitted CMF). This action has parallels to regression analysis where the CPM is considered the âfullâ model and the CPM-with-omitted-CMF is the âreducedâ model. The full model will demonstrate improved fit to the data, relative to the reduced model, if the variable added is statistically significant. As indicated by Equation 3 and the trends in Figure 1, this improved fit is indicated by a reduction in the overdispersion parameter.

125 In practice, when there is a Case C application, the overdispersion parameter reported for the CPM is not adjusted by the analyst to reflect the omission of the CMF. As a result, the overdispersion parameter in a Case C application is biased to a smaller value than the true value. As indicated by the coefficients of Equation 3, a Case C application could result in the overdispersion parameter being biased low by up to 10 percent. The method of statistical differentials was used to determine the unbiased estimate of the overdispersion parameter for the full model. This derivation is based on a log-linear representation of the full model. The method of statistical differentials indicates that the unbiased overdispersion parameter is estimated using the following equation: Equation 42 ð , ð ð ð , ðððððððð¡ððð ððððððð¡ððð where kp,r = predicted overdispersion parameter for reduced model; kf = overdispersion parameter for the full model (i.e., the k reported for the CPM); and ð , = variance of the independent variable (in the omitted CMF) at sites used to establish the CPM base conditions (i.e., those sites used to estimate the CPM regression coefficients). The predicted overdispersion parameter kp,r is an estimate of the overdispersion parameter that would be obtained if the CPM was developed as a regression model that excluded the variable(s) in the omitted CMF. The correlation correction term in Equation 42 represents the correlation between the variable in the omitted CMF and the variables in the CPM. In the preceding discussion associated with Equation 3, it was noted that the correlation correction in two databases assembled for CPM development was fairly consistent and could be described using an empirically-based correction factor. This correction factor was shown in parenthesis in Equation 4. It can be mathematically added to Equation 42 to produce the following equation. Equation 43 ð , ð ð ð , â with Equation 44 â 1 0.10 2 ððð 5,ð 1 Equation 45 ð ð¿ð ð¶ðð¹ð ð where p is the number of empirically derived constants in the CMFs associated with the CPM plus those in the omitted CMF (i.e., exclude those in the SPF for intercept, AADT, and segment length); and all other variables as previously defined. Additional discussion on the use and meaning of these three equations is provided in the previous section associated with Case B (see discussion following Equation 25). The preceding discussion describes the bias due to a Case C application where the overdispersion parameter is not adjusted to reflect the presence of the external CMF. The following equation is offered for estimating this bias.

126 Equation 46 ðµððð , 100 ð ð ,ð , This bias can be removed if Equation 43 is used for Case C applications to predict kp,r . This predicted value can then be used to estimate the variance of the predicted crash frequency. It can also be used with the EB method to estimate the expected crash frequency. Values of the bias in the overdispersion parameter for a Case C application are listed in Table 9 for typical values of number-of-variables, standard deviation, and estimation coefficient b. A non-zero bias value indicates conditions where the reported overdispersion parameter kf is biased (when used in a Case C application). Table 9. Overdispersion parameter bias for Case C. Standard Deviation of Independent Variable... Percentage Bias by Estimation Coefficient b (illustrative CMF value in parentheses), percent Number of Variables in the CMFs, p At Sites Establishing Base Condition, ðð,ð b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 1 0 0.0 0.0 0.0 0.0 0.5 -0.1 -0.4 -1.0 -1.8 1 -0.4 -1.8 -3.9 -6.7 2 -1.8 -6.7 -13.9 -22.4 2 0 0.0 0.0 0.0 0.0 0.5 -0.1 -0.3 -0.8 -1.4 1 -0.3 -1.4 -3.1 -5.3 2 -1.4 -5.3 -11.2 -18.3 3 0 0.0 0.0 0.0 0.0 0.5 -0.1 -0.2 -0.6 -1.0 1 -0.2 -1.0 -2.2 -3.8 2 -1.0 -3.8 -8.3 -13.8 Note: a negative percentage bias indicates that the estimate from Equation 43 is larger than the reported value by the percentage indicated. Percentages listed are based on kf = 0.50; other percentages will apply for other values of kf. The absolute value of the bias percentages in Table 9 tends to increase with an increase in the absolute value of the estimation coefficient b. Thus, omitted CMF values that are significantly different from 1.0 are more likely to be associated a larger bias. The absolute value of the bias tends to be larger with an increase in the standard deviation of the independent variable (in the omitted CMF) at the sites used to estimate the CPM base conditions (ð , ). The absolute value of the bias decreases when there are more CMFs (and associated variables) in the CPM. Variance of the Predicted Value Table 2 describes a series of equations that can be used to compute the variance of the predicted crash frequency. Three different sets of equations are described and correspond to the stated assumptions in the first column of the table. Column 2 of the table provides equations for computing the variance of the predicted average crash frequency. These equations are discussed in this section.

127 For a Case C application, the CPM is considered the âfullâ model and the CPM-with-omitted-CMF is the âreducedâ model. If the database used to develop the CPM could be acquired, the independent variable associated with the omitted CMF removed, and the only CPM overdispersion parameter re-estimated using regression analysis, the resulting model would be considered the âreducedâ model for this discussion. The equations for computing the variance of the predicted average crash frequency associated with Assumed Condition 2 are considered to provide the most reliable estimates because they account for variation in the regression coefficient estimates and correlation among the independent variables. These equations can be used with the aforementioned full model to estimate the âtrueâ variance of the predicted crash frequency. These equations can also be used with the reduced model to estimate associated variance of the predicted crash frequency for a Case C application. An examination of the equations associated with Condition 2 indicates that the coefficient of variation (CV = standard deviation of predicted average crash frequency divided by the predicted average crash frequency) for both the reduced model and the full model is relatively constant for a wide range of independent variable values within a given database. It follows that the ratio of these two CV values is also relatively constant for a range of variable values. The examination indicated that this ratio can be approximated by the following equation. Equation 47 ð¶ðð ð¶ðð¶ð ð ð . where CVRC = coefficient of variation ratio for Case C; CVr = coefficient of variation for equations associated with Assumed Condition 2 using reduced model (i.e., CPM with variable associated with one CMF removed) (= [Vm,2]0.5/Np,r); CVf = coefficient of variation for equations associated with Assumed Condition 2 using full model (= [Vm,2]0.5/Np,f); kf = overdispersion parameter for full model (i.e., the k reported for the CPM); and kr = overdispersion parameter for the reduced model. Equation 47 can be combined with Equation 43 to estimate the CVR for typical conditions found in roadway safety data. The values obtained from Equation 47 are listed in Table 10. The ratio values are shown to equal or exceed 1.0. This result implies that the standard deviation of the predicted mean crash frequency (and associated confidence intervals) is larger for a Case C application, than it would be if the CMF was not omitted.

128 Table 10. Coefficient of variation ratio for Case C. Standard Deviation of Independent Variable... Coefficient of Variation Ratio by Estimation Coefficient b (illustrative CMF value in parentheses), percent Number of Variables in the CMFs, p At Sites Establishing Base Condition, ðð,ð b = -0.05 (0.95) b = -0.1 (0.90) b = -0.15 (0.86) b = -0.2 (0.82) 1 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.01 1.01 1 1.00 1.01 1.02 1.04 2 1.01 1.04 1.08 1.13 2 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.00 1.01 1 1.00 1.01 1.02 1.03 2 1.01 1.03 1.06 1.11 3 0 1.00 1.00 1.00 1.00 0.5 1.00 1.00 1.00 1.00 1 1.00 1.00 1.01 1.02 2 1.00 1.02 1.04 1.08 Note: Ratios listed are based on kf = 0.50; larger ratios occur for smaller values of kf. As shown in Table 10, the ratio values increase with an increase in the absolute value of the estimation coefficient b. Thus, external CMFs that are significantly different from 1.0 are more likely to be associated with a larger ratio. The ratio also tends to be larger with an increase in the standard deviation of the independent variable (in the omitted CMF) at the sites used to estimate the CPM base conditions (ð , ). The ratio tends to decrease when there are more CMFs (and associated variables) in the CPM. Experimental Design and Analysis Results The section describes the development and execution of an experimental design that has as its objective the validation of the bias prediction equations described in the previous section. These equations are summarized in the list below. ï· Bias adjustment factor for Case A, fA (Equation 9) Bias adjustment factor for Case B, fB ( ï· Equation 13) ï· Bias for Case B, BiasB (Equation 22) ï· Predicted overdispersion parameter for full model in Case B, kp,f (Equation 25) ï· Coefficient of variation ratio in Case B, CVRB (Equation 29) ï· Bias adjustment factor for Case C, fC (Equation 32) ï· Bias for Case C, BiasC (Equation 40) ï· Predicted overdispersion parameter for reduced model in Case C, kp,r (Equation 43) ï· Coefficient of variation ratio in Case C, CVRC (Equation 47) It was judged that simulated crash data (using a Monte Carlo process) would be the most feasible means by which the objective of the experimental design could be achieved.

129 This section consists of four subsections. The experimental design is described in the next three subsections. The analysis results are provided in the fourth subsection. Site-Level Database This section describes the elements of two databases that were created to represent a crash history for each of a large number of sites. One database was prepared for CPM development. The second database was prepared to quantify the potential bias associated with CPM Application Cases A, B, or C (see Table 1). The elements associated with each database are described in the following two subsections. CPM Development Database The section describes the âCPM developmentâ database that was used to estimate the CPM (full or reduced model) using regression analysis. From the perspective of CPM development, this database is described as a âmultiple-variable database.â The data in this database were created using Monte Carlo simulation where the true relationship between crash frequency and site characteristics was specified using a log-linear model form. A log-linear model with the following form was used for this purpose. Equation 48 ð , ðð¥ð ð ð ð¿ð ð´ð´ð·ð/1000 ð ð ð where Np,f = predicted average crash frequency from full model, crashes/yr; AADT = annual average daily traffic volume, veh/d; X = independent variable associated with a CMF; Xbase = base value for the independent variable; and bi = estimation coefficient i. This equation is considered the full model for this investigation. Its first two terms are considered to represent the SPF (i.e., NSPF = exp[b0 + b1 Ln(AADT/1000)]). The third term is considered to represent the CMF of interest (i.e., CMF = exp[b2 (X â Xbase)]). The independent variable X was removed from the model to produce the following reduced model. Equation 49 ð , ðð¥ð ð ð ð¿ð ð´ð´ð·ð/1000 where Np,r = predicted average crash frequency from reduced model, crashes/yr. The full model was used (with specified values for the coefficients bi) to estimate the mean crash frequency for a range of independent variable values (i.e., AADT and X), where each pair of independent variables was envisioned to represent one âsiteâ. The mean produced in this manner was then used (with a specified overdispersion parameter value of 0.5) in a Monte Carlo manner to produce a second mean crash frequency for each site that included a Gamma-distributed random element. Finally, this second mean was then used in a Monte Carlo manner to produce an annual crash count that included a Poisson-distributed random element. The AADT variable was specified to range uniformly from 1000 to 15,000 veh/d. The independent variable X was specified to have an overall mean and standard deviation for the collection of sites. The value of X for a given site was computed in a Monte Carlo manner based on a Normal distribution (with the aforementioned overall mean and standard deviation). An Excel spreadsheet was used to automate these calculations and produce data for 2000 sites.

130 Local-Jurisdiction Database To replicate the conditions associated with Application Cases A, B, and C, data for a second set of sites were also generated. These sites represent locations in the analystâs jurisdiction and to which the estimated CPM was applied. For these sites, the AADT variable was specified to range uniformly from 1000 to 15,000 veh/d. The independent variable X was specified to have an overall mean and standard deviation for the collection of sites. The value of X for a given site was computed in a Monte Carlo manner based on a Normal distribution (with the aforementioned overall mean and standard deviation). Data Collection Process This section describes the process used to create the CPM development database and the local- jurisdiction database. The data in each of these databases describes 2000 sites for which (1) one set of âtrueâ estimation coefficients was established and (2) the independent variable X had one specified mean and standard deviation. For each database, the estimation coefficients, their standard error, and the overdispersion parameter were computed using regression analysis. Also quantified for each database was the average values of the CPM predicted crash frequency and coefficient of variation. The process used to create the site-level data varied for each application case. The following subsections describe this process for each case. Case A. CMF from Part D used with SPF (CMF is consistent with SPF base conditions) The process used to create the data for Case A is described as follows. 1. Create CPM Development Data A mean and standard deviation for the independent variable Xdd is specified. These parameters are used to create a value of X for each site in a Monte Carlo manner. Data for 2000 sites are created in this manner. Values of b0, b1, and b2 are specified and used with Equation 48 to estimate a true mean crash frequency for each site. Using the Monte Carlo process described in the previous section, an annual crash count was computed for each site. 2. Create Local-Jurisdiction Data A mean and standard deviation for the independent variable Xljd is specified. These parameters are used to create a value of X for each site in a Monte Carlo manner. Data for 2000 sites are created in this manner. 3. Estimate Full Model Regression analysis based on a maximum-likelihood criterion (using a negative-binomial distribution for the dependent variable) is used to estimate coefficients in Equation 48 (as well as the overdispersion parameter, standard error of each coefficient, and variance-covariance matrix of the estimated coefficients). The CPM development data is used for this purpose. This equation is then used to compute the predicted crash frequency Np for each site in the CPM development database. The estimated coefficient b2 is used to define the CMF corresponding to the independent variable X. This CMF is designated as the âCMF obtained from HSM Part D.â 4. Define Reduced Model The estimated coefficients b0 and b1 from Step 3 are inserted in Equation 49 to obtain the reduced model. This equation is then used with the CMF from Step 3 to compute the predicted crash frequency NA for each site in the local-jurisdiction database. This process replicates the use of a CMF from HSM Part D with an SPF that includes the base condition associated with the CMF. The standard error of both coefficients and the variance-covariance matrix of the coefficients is computed.

131 5. Compute the Desired Results The following values are computed for the sites in each site-specific database. All values represent the collective set of 2000 sites in the database. The phrase âoverallâ is intended to convey this representation when necessary. ï§ Overall average of the predicted crash frequency in the CPM development database, Np ï§ Variance of the independent variable X at sites in the CPM development database, ð , ï§ Estimation coefficient associated with variable X, b ï§ Overall average of the predicted crash frequency in the local-jurisdiction database, NA ï§ Variance of the independent variable X at sites in the local-jurisdiction database, ð , ï§ Overall average bias adjustment factor, fA Case B. CMFs Do Not Have a Corresponding Base Condition in the SPF The process used to create the data for Case B is described as follows. 1. Create CPM Development Data The process is the same as for Case A. 2. Create Local-Jurisdiction Data The process is the same as for Case A. 3. Estimate Full Model The process is the same as for Case A. The CMF derived from this model is considered the âexternalâ CMF in subsequent steps. 4. Compute the Variance of the Predicted Crash Frequency For each site in the CPM development database, compute the variance of the predicted average crash frequency using the equations associated with Assumed Condition 2 in Table 2. Use this value and the predicted average crash frequency to compute the coefficient of variation (CV) for each site. 5. Estimate Reduced Model Regression analysis based on a maximum-likelihood criterion (using a negative-binomial distribution for the dependent variable) is used to estimate the coefficients in Equation 49 (as well as the overdispersion parameter, standard error of each coefficient, and variance-covariance matrix of the estimated coefficients). The CPM development data is used for this purpose. This equation is then used with the external CMF from Step 3 to compute the predicted crash frequency NB for each site in the local-jurisdiction database. This process replicates the use of an external CMF with an SPF (i.e., the independent variable in the CMF is not represented as a base condition of the SPF). 6. Compute the Variance of the Predicted Crash Frequency For each site in the local-jurisdiction database, compute the variance of the predicted average crash frequency using the equations associated with Assumed Condition 3 in Table 2. The regression results from Step 5 are used for this purpose. Use the computed variance and the predicted average crash frequency to compute the coefficient of variation (CV) for each site. Also computed is the CV ratio for each site (= CV from this step divided by CV from Step 4). 7. Compute the Desired Results The following values are computed for the sites in each database. All values represent the collective set of 2000 sites in the database. The phrase âoverallâ is intended to convey this representation when necessary.

132 ï§ Overall average of the predicted crash frequency in the CPM development database, Np ï§ Variance of the independent variable X at sites in the CPM development database, ð , ï§ Estimation coefficient associated with variable X in full model, b ï§ Overall average of the independent variable X at sites in the CPM development database, ð ï§ Overdispersion parameter associated with full model, kf ï§ Overall average of the predicted crash frequency in the local-jurisdiction database, NB ï§ Variance of the independent variable X at sites in the local-jurisdiction database, ð , ï§ Overall average of the independent variable X at sites in the local-jurisdiction database, ð ï§ Overdispersion parameter associated with reduced model, kr ï§ Overall average bias adjustment factor, fB ï§ Overall average CV ratio, CVR Case C. CMF Not Used in CPM But Base Condition Accommodated in the SPF The process used to create the data for Case C is described as follows. 1. Create CPM Development Data The process is the same as for Case A. 2. Create Local-Jurisdiction Data The process is the same as for Case A. 3. Estimate Full Model The process is the same as for Case A. The CMF derived from this model is considered the âomittedâ CMF in subsequent steps. Unlike the other two cases, Np is not obtained in this step (see Step 5). 4. Compute the Variance of the Predicted Crash Frequency For each site in the CPM development database, compute the variance of the predicted average crash frequency using the equations associated with Assumed Condition 2 in Table 2. Use this value and the predicted average crash frequency to compute the coefficient of variation (CV) for each site. 5. Define Reduced Model and Estimate Overdispersion Parameter The estimated coefficients b0 and b1 from Step 3 are inserted in Equation 49 to obtain the reduced model. Regression analysis based on a maximum-likelihood criterion (using a negative-binomial distribution for the dependent variable) is used to estimate the overdispersion parameter for reduced model (as well as the standard error of each coefficient and variance-covariance matrix of the estimated coefficients). The CPM development data is used for this purpose. The reduced model is then used with the omitted CMF to compute the predicted crash frequency Np for each site in the local-jurisdiction database. The reduced model is then used alone to compute the predicted crash frequency NC for each site in the local-jurisdiction database. This process replicates the omission of a CMF from a CPM that includes the base condition associated with the CMF. 6. Compute the Variance of the Predicted Crash Frequency For each site in the local-jurisdiction database, compute the variance of the predicted average crash frequency using the equations associated with Assumed Condition 2 in Table 2. The regression results from Step 5 based on the reduced model alone are used for this purpose. Use the computed variance and the predicted average crash frequency to compute the coefficient of variation (CV) for each site. Also compute the CV ratio for each site (= CV from this step divided by CV from Step 4).

133 7. Compute the Desired Results The following values are computed for the sites in each database. All values represent the collective set of 2000 sites in the database. The phrase âoverallâ is intended to convey this representation when necessary. ï§ Overall average of the predicted crash frequency in the CPM development database, Np ï§ Variance of the independent variable X at sites in the CPM development database, ð , ï§ Estimation coefficient associated with variable X in full model, b ï§ Overall average of the independent variable X at sites in the CPM development database, ð ï§ Overdispersion parameter associated with full model, kf ï§ Overall average of the predicted crash frequency in the local-jurisdiction database, NC ï§ Variance of the independent variable X at sites in the local-jurisdiction database, ð , ï§ Overall average of the independent variable X at sites in the local-jurisdiction database, ð ï§ Overdispersion parameter associated with reduced model, kr ï§ Overall average bias adjustment factor, fC ï§ Overall average CV ratio, CVR Evaluation Database This section describes the creation of an âevaluationâ database wherein the model estimation results and overall average values for each site-level database is recorded. Each row of the evaluation database represents an âobservationâ and corresponds to the results and values for one unique combination of CPM development and local-jurisdiction database (i.e., each observation represents the results for 2000 similar sites). Each combination corresponds to one set of true estimation coefficients, a mean and standard deviation for the independent variable X in the CPM database, and a mean and standard deviation for the independent variable X in the local jurisdiction database. The true value of the estimation coefficients, independent variable mean, and independent variable standard deviation that were varied to create the various CPM and local jurisdiction databases are listed in Table 11. Two sets of values are established to represent both CMF functions (i.e., a CMF having a value that is a function of a continuous variable, e.g., lane width) and discrete CMFs (i.e., a CMF having one value that corresponds to treatment presence, e.g., add beacon). For discrete CMFs, the database includes an independent variable X that equals â1â if the treatment is present and a â0â otherwise. The mean value of this variable is an input and its standard deviation is defined by the Bernoulli distribution.

134 Table 11. Variables used to create the evaluation database. Category Variable Values by CMF Category Continuous Discrete Estimation coefficients Mean crash frequency, m 2, 4, 6 crashes/yr 2, 4, 6 crashes/yr AADT coefficient b1 0.8, 1.0, 1.2 0.8, 1.0, 1.2 X coefficient b2 -0.1, -0.01, 0.1 -0.1, -0.01, 0.1 CPM development database Mean value of X, Xdd 10, 20, 30 0.3, 0.5, 0.7 Standard deviation of X, sdd 1, 2, 3 ð 1 ð . Local jurisdiction database Mean value of X, Xljd Case A: = Xdd Case B, C: Xdd -1, Xdd, Xdd+1 0.0, 0.5, 1.0 Standard deviation of X, sljd Case A: 1, 3, 5 Case B, C: = Sdd ð 1 ð . Note: The mean crash frequency m was used with Equation 48 to compute the coefficient b0, based on specified values of b1, b2, average AADT, and average X for the 2000 sites in the CPM development database. The last two columns of Table 11 shows the different values used for each variable, with most variables having three unique values. In some instances, the value of one variable is defined to be a function of another variable. One unique set of variable values corresponds to one observation (i.e., row) in the evaluation database. Each additional observation was created by changing one variable value. In this manner, a factorial design was used to create the observations in the evaluation database. A total of 243 unique combinations were created for the continuous CMF case and 243 unique combinations were created for the discrete CMF case. In this manner, the evaluation database contained 486 observations. The last two rows of Table 11 indicate that the variables for the local jurisdiction data were altered depending on the application case of interest. This refinement required the creation of an additional 243 observations for the analysis of Case A. Thus, the evaluation database ultimately contained 729 observations (= 486 + 243). Analysis Results This section describes the findings from the evaluation of the bias prediction equations described in the previous section titled Model Development. The evaluation database described in the previous subsection was used for this purpose. The evaluation process included an initial comparison of the original prediction equation estimates with the ground-truth estimates in the evaluation database. If this comparison indicated that there was some bias between the estimate pairs, then a modification was made to the prediction equation to remove the bias. The equations shown in this section provide unbiased predictions and are recommended for application. Case A. CMF from Part D used with SPF (CMF is consistent with SPF base conditions) The evaluation data were used to examine the bias prediction adjustment factor for Case A. Equation 9 was previously provided for computing this factor. A comparison of the predicted values of this factor with the ground-truth values indicated a small bias. This bias was removed by including a correction term in Equation 9. The revised version of this equation with the correction term is shown below. Equation 50 ð 1 0.5 ð ð , ð , ð with

135 Equation 51 ð 1.120 ðð ð 0; 0.880 ðð¡âððð¤ðð ð where fA = bias adjustment factor for Case A; ct = correction term; and all other terms are as previously defined. The correction factor values were determined using a search routine that sought to minimize the sum of the squared error in the individual site observations. A comparison of the estimates from Equation 50 with the ground-truth values is shown in Figure 2. The trend line shown represents a regression line that was fit to the data. The regression coefficients and R2 are also shown in the figure. The slope coefficient is very near 1.0 and the intercept coefficient is near 0.0 which indicates that there is no significant bias in the predicted factor. The R2 is near 1.0, which indicates good agreement between the predicted and true values. Figure 2. Predicted versus true values of the bias adjustment factor for Case A. Case B. CMFs Do Not Have a Corresponding Base Condition in the SPF The evaluation data were used to examine the four equations identified in the following list. Bias adjustment factor for Case B, fB ( ï· Equation 13) ï· Bias for Case B, BiasB (Equation 22) ï· Predicted overdispersion parameter for full model in Case B, kp,f (Equation 25) ï· Coefficient of variation ratio in Case B, CVRB (Equation 29) The bias prediction adjustment factor for Case B was previously shown as Equation 13. A comparison of the predicted values of this factor with the ground-truth values indicated a small bias. This bias was removed by including a correction term in Equation 13. The revised version of this equation with the correction term is shown below. Equation 52 ð 1 0.5 ð ð , ð

136 where all variables are as previously defined and the correction term ct is obtained from Equation 51. A comparison of the estimates from Equation 52 with the ground-truth values is shown in Figure 3a. The trend line shown represents a regression line that was fit to the data. The regression coefficients and R2 are also shown in the figure. The slope coefficient is very near 1.0 and the intercept coefficient is near 0.0 which indicates that there is no significant bias in the predicted factor. The R2 is near 1.0, which indicates good agreement between the predicted and true values. a. Bias adjustment factor. b. Bias in predicted crash frequency. c. Overdispersion parameter for full model. d. Coefficient of variation ratio. Figure 3. Predicted versus true values for Case B. Figure 3b shows the comparison of the predicted and true bias in the predicted crash frequency. The predicted bias was calculated with Equation 22, where Equation 52 was used to estimate fB. The trend line and regression statistics shown indicate good agreement between the predicted and true values. The equation for estimating the overdispersion parameter associated with a Case B application was previously shown as Equation 25. A comparison of the predicted values of this factor with the ground-truth values indicated a small bias. This bias was removed by including a correction constant of â1.13â in the second term of Equation 25. The revised version of this equation with the correction constant is shown below. Equation 53 ð , ð 1.13 ð ð , â

137 where all variables are as previously defined. Figure 3c shows the comparison of the predicted and true values of the overdispersion parameter. The predicted parameter was calculated with Equation 53. The trend line and regression statistics shown indicate reasonably good agreement between the predicted and true values. There is a tendency for smaller predicted parameter values to slightly underestimate the true value. Figure 3d shows the comparison of the predicted and true ratios of the coefficient of variation. The predicted ratio was calculated with Equation 29. The trend line and regression statistics shown indicate good agreement between the predicted and true values. Case C. CMF Not Used in CPM But Base Condition Accommodated in the SPF The evaluation data were used to examine the four equations identified in the following list. ï· Bias adjustment factor for Case C, fC (Equation 32) ï· Bias for Case C, BiasC (Equation 40) ï· Predicted overdispersion parameter for reduced model in Case C, kp,r (Equation 43) ï· Coefficient of variation ratio in Case C, CVRC (Equation 47) The bias prediction adjustment factor for Case C was previously shown as Equation 32. A comparison of the predicted values of this factor with the ground-truth values indicated a small bias. This bias was removed by including a correction term in Equation 32. The revised version of this equation with the correction term is shown below. Equation 54 ð 1 0.5 ð ð , ð where all variables are as previously defined and the correction term ct is obtained from Equation 51. A comparison of the estimates from Equation 54 with the ground-truth values is shown in Figure 4a. The trend line shown represents a regression line that was fit to the data. The regression coefficients and R2 are also shown in the figure. The slope coefficient is very near 1.0 and the intercept coefficient is near 0.0 which indicates that there is no significant bias in the predicted factor. The R2 is near 1.0, which indicates good agreement between the predicted and true values. Figure 4b shows the comparison of the predicted and true bias in the predicted crash frequency. The predicted bias was calculated with Equation 40, where Equation 54 was used to estimate fC. The trend line and regression statistics shown indicate good agreement between the predicted and true values. The equation for estimating the overdispersion parameter associated with a Case C application was previously shown as Equation 43. A comparison of the predicted values of this factor with the ground-truth values indicated a small bias. This bias was removed by including a correction constant of â1.16â in the second term of Equation 43. The revised version of this equation with the correction constant is shown below. Equation 55 ð , ð 1.16 ð ð , â where all variables are as previously defined. Figure 4c shows the comparison of the predicted and true values of the overdispersion parameter. The predicted parameter was calculated with Equation 55. The trend line and regression statistics shown indicate good agreement between the predicted and true values.

138 Figure 4d shows the comparison of the predicted and true ratios of the coefficient of variation. The predicted ratio was calculated with Equation 47. The trend line and regression statistics shown indicate good agreement between the predicted and true values. a. Bias adjustment factor. b. Bias in predicted crash frequency. c. Overdispersion parameter for reduced model. d. Coefficient of variation ratio. Figure 4. Predicted versus true values for Case C. References for Appendix A Benjamin, J., and C. Cornell. (1970). Probability, Statistics, and Decision for Civil Engineers. McGraw-Hill, New York. iTrans (2006). Task 8A: Develop Decision Rule â AMF Acceptance Criteria, Working paper from NCHRP Project 17-27, Parts I and II of the Highway Safety Manual, iTrans Consulting Ltd. Lord, D. (2008). Methodology for Estimating the Variance and Confidence Intervals for the Estimate of the Product of Baseline Models and AMFs. Accident Analysis and Prevention, Vol. 40, pp. 1013-1017. Lord, D., S. Geedipally, B. Persaud, S. Washington, I. van Schalkwyk, J. Ivan, C. Lyon, and T. Jonsson. (2008). NCHRP Web-Only Document 126: Methodology to Predict the Safety Performance of Rural Multilane Highways. Transportation OOResearch Board of the National Academies, Washington, D.C. Lord, D., P. Kuo, and S. Geedipally. (2010). Comparison of Application of Product of Baseline Models and Accident- Modification Factors and Models with Covariates: Predicted Mean Values and Variance. Transportation Research Record: Journal of the Transportation Research Board, No. 2147, pp. 113-122. Maher, M., and I. Summersgill. (1996). A Comprehensive Methodology for the Fitting of Predictive Accident Models. Accident Analysis and Prevention, Vol. 28, No. 3. pp. 281-296.

139 Miaou, S.P. (1996). Measuring the Goodness-of-Fit of Accident Prediction Models. FHWA-RD-96-040. Federal Highway Administration, Washington, D.C. Wood, G.R. (2005). Confidence and Prediction Intervals for Generalized Linear Accident Models. Accident Analysis and Prevention, Vol. 37, No. 2, pp. 267â273. Vogt, A. (1999). Crash Models for Rural Intersections: Four-Lane by Two-Lane Stop-Controlled and Two-Lane by Two- Lane Signalized. Report No. FWHA-RD-99-128. Federal Highway Administration, Washington, D.C.

Understanding and Communicating Reliability of Crash Prediction Models (2021)

Chapter: Appendix A: The Development of Procedures for Quantifying the Reliability of Crash Prediction Model Estimates with a Focus on Mismatch Between CMFs and SPF Base Conditions

Welcome to OpenBook!

Get Email Updates