National Academies Press: OpenBook

Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data (2011)

Chapter: Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions

« Previous: Chapter 5 - Defining and Evaluating Lane Departure Crash Surrogate Thresholds Using Naturalistic Driving Study Data
Page 76
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 76
Page 77
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 77
Page 78
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 78
Page 79
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 79
Page 80
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 80
Page 81
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 81
Page 82
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 82
Page 83
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 83
Page 84
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 84
Page 85
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 85
Page 86
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 86
Page 87
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 87
Page 88
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 88
Page 89
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 89
Page 90
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 90
Page 91
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 91
Page 92
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 92
Page 93
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 93
Page 94
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 94
Page 95
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 95
Page 96
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 96
Page 97
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 97
Page 98
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 98
Page 99
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 99
Page 100
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 100
Page 101
Suggested Citation:"Chapter 6 - Analytical Tools and Initial Analysis of Lane Departure Research Questions." National Academies of Sciences, Engineering, and Medicine. 2011. Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data. Washington, DC: The National Academies Press. doi: 10.17226/22848.
×
Page 101

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

C H A P T E R 6 Analytical Tools and Initial Analysis of Lane Departure Research QuestionsThis chapter outlines several exploratory analytical approaches that were used to evaluate the existing naturalistic driving study data and that may be appropriate for analyzing the data that will result from the full-scale naturalistic driving study data to answer a variety of lane departure research questions. The first is a data mining approach (classification and regres- sion tree analysis). The second uses odds ratio and logistic regression. The third approach describes how the exploratory method used in the second can be expanded for the full-scale study to account for repeated measurements. The fourth approach is a time series analysis. Each approach uses data sam- pled in a different way. Each is described in a separate section, with the following information provided for each section: • Background information that describes the general methodology; • Details on how the approach was used to conduct an initial analysis of existing naturalistic driving studies; • Results from the initial analysis of existing data; • Considerations for the full-scale study, with a particular focus on data reduction and sampling; • Limitations of the method using the existing data; and • Expected limitations and advantages for the full-scale study. Information common to all methods, such as data sam- pling approaches and data reduction, is provided in separate sections. Objective The objective of this analysis plan was to develop and explore methodologies to answer research questions relating to lane departure crashes. The focus was to identify which roadway, driver, vehicle, and environmental factors are the best explana- tory variables in predicting an increased likelihood of lane departures and lane departure crashes.76Improved data about actual events that lead to a lane depar- ture crash or a noncrash incident will be extremely valuable in developing a better understanding of what negative factors lead to crashes and near misses, as well as of the factors that result in more positive subsequent events and outcomes. Understanding why crashes did not occur yields as much useful information as evaluating why they did occur. In both cases, factors that cause a vehicle to initially leave the roadway and the relationship between road, environment, vehicle, and human factors and subsequent events and outcomes can be studied. Dingus et al. (2006) reported that analysis of near crashes from the VTTI naturalistic driving study was valuable, as it demonstrated drivers successfully performing evasive maneuvers. The intent of answering lane departure research questions is to provide roadway agencies and other practitioners with information about which factors positively or negatively influ- ence the likelihood of a lane departure. A better understand- ing of roadway factors will allow agencies to better address safety in roadway design and assess the benefits of various countermeasures, such as rumble strips, flattening or better delineating curves, mandating paved shoulders on reconstruc- tion and rehabilitation projects, and policy. A better under- standing of driver factors related to lane departures will allow agencies to make better policy decisions, such as addressing younger driver training and licensing. A better understanding of environmental factors will enable agencies, for instance, to make informed winter maintenance decisions and determine trade-offs in application of street lighting. Audience The primary audiences who can utilize the information obtained from answering lane departure research question in the full-scale study are state, county, and local transporta- tion agencies and policy makers. Consequently, the informa- tion obtained should be in a format that can be used to make

77informed decisions about improved highway design during initial design and during reconstruction and rehabilitation. The information can also be used to select appropriate road- way countermeasures and guide policy decisions. Hence, the outcome of this lane departure analysis should provide quan- titative relationships between lane departure crash likelihood and explanatory factors so that agencies can estimate the ben- efits and costs of implementing countermeasures. Given the likely audience, results presented in the form of “rumble strips reduce lane departures by 20%” or “drivers are four times more likely to be involved in a fatal/major injury crash on a two-lane roadway with 12-ft lanes and 6-ft gravel shoulders than on a two-lane roadway with 12-ft lanes and 2-ft paved shoulders” would be the most useful for compar- ing alternatives. Consequently, analysis methods that provide crash reduc- tion factors or odds ratios may be the most beneficial for pro- viding specific information that can be used in assessing the costs and benefits of different designs or countermeasures. Highway engineers and policy makers have some familiarity with these types of analyses, and the results of these types of analysis can be communicated to the general public. However, it will only be possible to create crash reduction factors if suf- ficient crashes are available in the full-scale study. The information in this report is geared toward those who will conduct or review lane departure analyses using the naturalistic driving study. Data Availability in Full-Scale Naturalistic Driving Study to Answer Lane Departure Research Questions This section provides a brief discussion of the data expected to be available in the full-scale naturalistic driving study to answer lane departure research questions as they relate to the discussion on analytical methods in this section. Chapter 4 provides an in-depth review of the most recently available information and discusses the availability of data in the full- scale study to answer lane departure research questions. The accuracy, frequency of collection, and resolution that are expected to be necessary to address lane departure research questions is presented, and comments regarding the adequacy of the expected data collection are provided. Dynamic driver and vehicle data are expected to be collected by the vehicle instrumentation DAS at 10 Hz (0.1-s intervals). Data will be reported at this level of resolution. Some data ele- ments may be collected at a higher resolution (at a rate higher than 10 Hz) and will be aggregated to the 10-Hz level. Other data will be collected at a lower frequency or resolution and will be reported at 10 Hz.Sensors available in the DAS that will monitor drivers include left-side, right-side, and head position driver video and a pas- sive alcohol sensor. There has been some discussion about a “head position tracking system” being provided in the DAS. It is unknown at this time whether this will be available or whether head position tracking will be completed for all data or only for a subset of the data. Dynamic vehicle factors from the DAS include forward/side radar; collection of vehicle kinematics (e.g., speed, acceleration, side acceleration); vehicle spatial position; and forward, side, and back video. The final data set available to researchers is expected to con- sist of a spatial data set that contains individual vehicle/driver activity data at 10 Hz (1 row or frame per 0.1 s) and that infor- mation from other sources will be linked to that database and reported at the same level, even if those data are not collected at the same frequency. Static driver and vehicle variables may be either linked to the data set or provided in a relational data- base that can be joined to the spatial driver/vehicle data set. Roadway data will be collected by the mobile mapping sys- tem (SHRP 2 Safety Project S04B) or will be available from existing state databases. The mobile mapping system will only collect data from a sample of roadways in a given study area. As a result, the same roadway data and the same data accu- racy and resolution may not be available for all roadways. If the source of data is state databases, some differences will result across study areas. Roadway data, when available, is expected to be linked to the vehicle data in the full-scale study using spatial overlay. Most of the roadway data will be collected at a lower rate than 10 Hz but will be reported with the final data set at that level. For instance, shoulder width may only be measured once per mile. If linked with the vehicle data, shoulder width will be included as a field reported for each 0.1 s, but the value would be consistent for all 0.1-s observations between each 1-mi sample interval. Roadway variables that are not provided will need to be extracted from the outside vehicle imagery, aerial images, or other sources. Two dynamic environmental factors will be included in the DAS: time and outside temperature. All other environmental factors will need to be extracted from the outside vehicle imagery or other sources. Certain static characteristics, such as driver age, driver gen- der, vehicle type, and vehicle track width, will also be avail- able. If they are not included as data fields in the continuous data, they can be linked and included as data fields in the continuous data. No dynamic driver factors will be provided with the final data set except for readings from a passive alcohol sensor. As indicated, some head tracking information may be available. All other driver factors will have to be reduced from the driver video (e.g., distractions). Reduction of driver data at the

780.1-s interval would require a tremendous amount of resources. Therefore, data applications using continuous data would likely need to reduce driver data at a lower resolution (e.g., once per minute). This information can then be linked to the contin- uous data. Some automation can be used, but applications would need to be developed for this. Driver factors relevant to driver distraction that would need to be reduced include the following: head position, which serves as a measure of eyeglance location; distractions (e.g., cell phone use, talking to passengers); hand position on steer- ing wheel or other location; and measures of fatigue, such as head drooping or yawning. Data Segmentation Approaches Modeling relies on obtaining the necessary data at the appro- priate level of accuracy, frequency, and resolution. Data can be extracted in different ways depending on the application. Researchers for SHRP Safety Project S02, Integration of Analy- sis Methods and Development of Analysis Plan (Boyle et al., 2010), developed a model segmentation approach that can be applied to answering research questions for the full-scale nat- uralistic driving study data, as described below. This approach is included in the analysis plan because it was decided that this was a useful structure for presenting the ways data were collected for the four analyses described in the following sections. The data segmentation is as follows: • Continuous (frame): At this level, data are modeled at the rate at which they were collected, resulting in very large sample sizes. This is similar to the raw data set that would result from the instrumented vehicle DAS. The data will be quality assured, and some review of the data will be necessary. The instrumented vehicle data is expected to be collected at 10 Hz (0.1-s intervals). The term “continuous” is used, although in reality the data are discrete because they rep- resent data aggregated to a set amount of time (i.e., data are aggregated to 0.1-s intervals). However, for all intents and purposes, the data can be considered as continuous. • Sequential blocks: At this level, data are sampled and aggre- gated to blocks or epochs in which they are summarized over consecutive time periods. For instance, a 5-min sam- pling rate would indicate that data over each 5-min period are summarized into one observation. Data from different data fields can be aggregated over the block of time in differ- ent ways. For instance, the data for a particular field could be averaged, it could be summed, minimum and maximum values could be provided, or the number of times a partic- ular value occurs could be reported. Data can be aggregated for any time period up to the trip level.• Sample based: Data at this level are sampled at regular time intervals but are not aggregated. For instance, driver head pose may be sampled and reduced by the researchers every 2 min. Data at this level represents a snapshot in time. • Event: Data at this level are aggregated for an incident (e.g., lane departure) or some other event of interest. Event data are aggregated for a set amount of time around an “event” to one observation per event (e.g., 30 s before the event start to 30 s after). An incident could be a crash, near crash, lane departure, and so forth. An example of an event of interest is vehicle activity in the vicinity of signalized intersections where one or more approaches have a posted speed limit of 50 mph or higher (high-speed signalized intersections). An event differs from a block in that it contains data only when an incident or event of interest has occurred. Examples of data at the continuous, sequential block, and sample-based levels are shown in Figures 6.1 to 6.3. The vari- able speed is “evaluated” for a theoretical database. As shown in Figure 6.1, data at the continuous level are used at the rate at which they are reported in the naturalistic driving study. As a result, one observation is present for each 0.1-s of driving (one row). Figure 6.2 shows data collected at the sequential block level. A 1-min sampling interval was selected, and data were aggregated over each 1-min period. Each minute of data would provide one observation. An example of the sample- based approach is shown in Figure 6.3 for the same data set. Data are sampled at 1-min intervals. As a result, one row of data is extracted for each 1-min sample period. Speed would be reported for the 0.1-s interval extracted. One observation would be present for each 1-min period, but the data would reflect the 0.1-s intervals only.General Information About Data Reduction This section provides general information about how the existing naturalistic driving study data were reduced for the analyses described in this chapter. Data for rural driving was requested from UMTRI for the road departure crash warn- ing (RDCW) field operation test (FOT). A description of the data request and detailed description of the data received and other data sets used is provided in Chapter 3. A detailed description of the data reduction process is provided in Chapter 4. Data were provided for subjects during the period when the RDCW system was functioning and recording data but not providing feedback to drivers. UMTRI provided data for rural roadways for 44 drivers. Vehicle activity data were pro- vided in a Microsoft Access database and were provided as continuous data. Each row of data represented 0.1 s of vehi- cle driving for one driver during a trip. Forward imagery was

79Figure 6.1. Example of data set showing data segmented at the continuous level.provided in most cases at 2 Hz (two rows per second, or one image per five rows of vehicle trace data). During times when the RDCW system alert reported that a lane departure may have occurred, forward imagery was provided at 10 Hz (10 rows per second, or 1 image per row of vehicle data trace data) for the 4 s before and 4 s after an alert was recorded. Several variables used in the analysis were provided with the data set: driver number, trip number, time since start of trip, driver age, driver gender, vehicle spatial position, head- ing, brake on or off, cruise control on or off, vehicle offset from center of lane, lane width, vehicle track width, speed, lateral speed, lateral acceleration, side acceleration, yaw rate, roll rate, pitch rate, wiper status, headlamp status, road type, posted speed limit, advisory speed, AADT, and number of thru lanes. In some cases, advisory speed and posted speed limit were not included and had to be obtained from the for- ward imagery.A number of variables that were not provided in the UMTRI data could be extracted or created from either the UMTRI data or from other available data sources. Other data sources included aerial imagery, a roadway database, and a crash database for Michigan. (The databases are described in Chapter 3; extraction of data elements is dis- cussed in Appendix A.) Because large amounts of data were provided and data reduc- tion became a time-consuming task, it was decided to focus on rural, two-lane, paved roadways. Only paved roadways were considered because the lane tracking system did not function well on unpaved surfaces, such as gravel. In order to determine what other variables should be extracted, the team conducted a comprehensive literature review and compiled a list of potential variables that have been shown to affect the likelihood and severity of lane depar- ture crashes (see Chapter 2).

80Figure 6.2. Example of data set showing data segmented at the sequential block level.All of the data elements that the team determined were important from the literature and could be obtained from one of the available databases (vehicle data, aerial imagery, roadway data, forward imagery, and crash database) were extracted. In several cases, data were obtained from the merging of two or more databases. For instance, curve radius and direction were determined by overlaying the vehicle database with aerial imagery and determining the start and end point in the vehicle data that corresponded to each curve, while curve radius was measured using the aerial imagery. The original continuous vehicle activity data from UMTRI were provided in a database with each row representing 0.1 s of activity for a particular driver/vehicle. When other variables were extracted from the various data sets, they were linked to the continuous data even if they were extracted at a lower res- olution. For instance, shoulder width was determined for a homogenous roadway section. All vehicle activity along thatsection would have been selected, and a data field “ShldWidth” would be populated with the single measurement for shoul- der width. A summary of the variables used in the different analyses is provided in Tables 6.1 and 6.2. A number of other variables were extracted, such as type of curve advisory signing and vis- ibility, but were not included in the analyses because of low sample sizes.Lane departures were determined by calculating vehicle wheel path using vehicle offset, lane width, and track width, as described in Appendix A. A lane departure was defined as a vehicle wheel path crossing over the right (right-side lane departure) or left (left-side lane departure) lane line and encroaching upon either the shoulder or the adjacent lane by 0.1 m or more. The threshold 0.1 m was used as a buffer because there is some uncertainty in estimation of wheel path. In all cases, the vehicle departed the lane and then returned to the initial lane of travel without losing

81Figure 6.3. Example of data set showing data segmented at the sample level.control or making sudden evasive maneuvers. This type of lane departure was referred to as an encroachment in the discussion on crash surrogates in Chapter 5. The UMTRI data set did not provide any near crash or crashes. It should be noted that some of the left-side lane departures may have been cases of drivers intentionally “cutting the curve.” It may be possible to ascertain this from the driver’s face video and from the driver’s hand position on the steering wheel. However, the team did not have access to this type of information. The data reduction resulted in a total of 22 right-side lane departure and 51 left-side lane departure events for two-lane rural roads. It also resulted in over 113,000 observations (0.1-s data frames) of normal driving. Data for which lane departure incidents occurred were modeled as continuous data in the data mining and time series analysis approaches and were summarized by event for the approach using an odds ratio. In this case, data for a block of time around left- or right-side lane departures were summa- rized as an “event.” The start point for each lane departure wasdetermined by identifying the point at which the vehicle began deviating from its path toward the edge of the lane, as shown in Figure 6.4. The end point of the event was the point after the vehicle returned to the roadway and corrected its path. The start and end times were noted at those points, and the contin- uous data for each event were extracted. A lane departure event included time spent drifting from the roadway or lane, time off the roadway or lane, and time returning to the original lane of travel. The average lane departure was approximately 8.0 s (80 instances of 0.1-s observations). Depending on the amount of time included, each event was weighted accordingly.Data for which no lane departure had occurred were used to represent normal driving data. Shankar et al. (2008) referred to exposure measures as “controls.” Events are situations of inter- est (crash, near crash), and controls are situations where the outcome is absent (normal driving). Risk can be determined by dividing the number of events by total exposure (“control”) for a cohort. Normal driving data were used as continuous data for the data mining and times series analyses and were summarized

82Variable Source Description Variable Type Driver Variables Age Gender OvrSpd5 OvrSpd10 OvrAdvSpd5 OvrAdvSpd10 Vehicle Variables Spd LatSpd Ax Ay RollRate PitchRate YawRate Following Environmental Variables Time EnvCond RoadSurf Provided with data set Calculated from speed and posted speed limit Calculated from speed and posted speed limit Provided with data set Extracted from forward video Extracted from forward video and time Extracted from forward video Extracted from forward video and wiper status Age of driver 1 = male, 2 = female Fraction of time driver exceeds the posted speed limit by 5 mph on rural, two-lane roads Fraction of time driver exceeds the posted speed limit by 10 mph on rural, two-lane roads Fraction of time driver exceeds the advisory curve speed by 5 mph on rural, two-lane roads Fraction of time driver exceeds the advisory curve speed by 10 mph on rural, two-lane roads Vehicle forward speed (m/s) Vehicle side speed (m/s) Forward acceleration (m/s2) Side acceleration (m/s2) Rate of roll (deg/s) Pitch rate (deg/s) Rate of yaw (deg/s) Subjective measure of vehicle following 0: Not following 1: Following 2: Following closely Indicates time of day 0: Day 1: Dawn/dusk/night There was no overhead lighting on any of the roadways, so all nighttime driving was dark/unlighted. Prevailing atmospheric conditions 0: Clear (no precipitation) 1: Light to moderate rain 2: Heavy rain 3: Light to moderate snow 4: Heavy snow 5: Fog 0: Dry 1: Wet There was no snow on any of the roadways. Numeric Categorical Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Numeric Categorical Categorical Categorical Categorical Table 6.1. Description of Driver, Vehicle, and Environmental Variables

83Variable Source Description Variable Type Roadway Variables Radius CurveType LaneWidth ShldWidth ShldType PvMCond DwyDen Other Variables AADT OnDen Conflict Angle MaxOff CrshDen Extracted from aerial imagery Extracted from forward video Provided Extracted from forward video Provided Extracted from forward video Extracted from forward video and vehicle data Extracted from vehicle data Extracted from Michigan crash database and aerial imagery Curve radius in m Direction of curve from perspective of driver 0: No curve 1: Right curve 2: Left curve Lane width in m Shoulder width in m Type of shoulder present 1: Paved 3: Gravel 4: Earth 6: No shoulder 7: Partially paved Pavement marking condition 0: Highly visible 1: Visible 2: Obscure Density of driveways to the right (driveways/m) Annual average daily traffic for roadway segment in vehicles per day Density of on-coming vehicle (vehicles/m) Indicates type of vehicle event 11: Normal driving 21: Right-side lane departure 31: Left-side lane departure Angle that vehicle left roadway during departure Maximum distance vehicle encroached into adjacent lane or shoulder during lane departure Density of lane departure crashes along roadway segment (crashes/m) Numeric Categorical Numeric Numeric Categorical Categorical Numeric Numeric Numeric Categorical Numeric Numeric Numeric Table 6.2. Description of Roadway and Other Variablesinto epochs, which are similar to events for the odds ratio analysis. Epochs were selected by driver and trip when roadway and environmental conditions were consistent. When a change in roadway occurred, a new epoch was created. For instance, data for a driver traveling along a specific roadway during a particular trip would be partitioned each time roadway condi- tions changed. Data along a tangent section would be marked as one epoch if the roadway cross section did not change.When the vehicle encountered a curve, a new epoch would be created that contained all of the vehicle activity on the curve. At the end of the curve, a new epoch would be created for the next tangent section. Data could not be partitioned by driver characteristics because dynamic driver characteristics were not available and static driver variables such as age and gender did not change. In most cases, environmental conditions were con- sistent across a roadway section, so it was not necessary to

84Figure 6.4. Begin and end point for event. Source: UMTRI RDCW data set.consider changes in environmental conditions. Data were summarized for each epoch. The length of each epoch was dif- ferent because drivers spent different amounts of time driving on a particular type of roadway. The number of 0.1-s intervals for each epoch was included as a weighting factor. Information about normal driving is useful because it can be used to represent exposure. One of the strengths of the naturalistic driving studies is that a substantial amount of normal driving will be available, which can be used to deter- mine a driver’s exposure for a particular set of circumstances. Currently, there is no realistic method to obtain exposure data for an individual driver, and it is even more difficult to obtain detailed exposure for a cohort of drivers. The most common measures to calculate exposure for a driver cohort is to use number of licensed drivers partitioned by age or some other characteristic. However, the use of number of licensed drivers assumes that all drivers drive an equal num- ber of miles and may overestimate or underestimate involve- ment if the driver group has different travel trends. For instance, older drivers may drive substantially less than drivers in other age groups. VMT by age group is a better measure because it demonstrates actual exposure, but it is difficult to obtain on a local or even state level. National studies, such as the National Personal Transportation Study (NPTS), have developed VMT fractions by age group, but national statistics may not be repre- sentative of state and local areas. Actual VMT by age group can be extracted from natura- listic driving study data for a given set of conditions. This will provide a unique opportunity to study risk by driver sub-population groups. Using the naturalistic driving study, the amount of driving a driver or group of drivers engages in on a particular roadway type can be used as a measure of exposure.Analysis Approach 1: Data Mining Three different analysis approaches were used to model the UMTRI data. The first was a data mining approach, described in this section. Description Data mining is the process of analyzing data to uncover pat- terns and establish relationships. Data mining processes may include the following (Search SQL Server, 2009): • Association, which involves looking for patterns consisting of events that are connected to each other; • Sequencing, which involves looking for patterns consisting of events where one event leads to another; • Classification, which involves looking for new patterns; • Clustering, which involves organizing groups of facts; and • Forecasting, which involves looking for patterns that can be used to make predictions. Data mining is the exploration and analysis of large amounts of data to discover meaningful patterns and rules in the data that are not evident. The process can be auto- mated or semiautomated (Collier et al., 1998). The discov- ery of patterns leads to additional knowledge. Data mining is useful for large data sets where patterns cannot easily be uncovered by human analysts. It also allows analysis of data that may never have been analyzed using other techniques. It can be used for both prediction and description (Tan et al., 2006).

85Sampling Approach A sample-based approach using a sampling interval of 0.1 s was used to model the data. As a result, every 10th observation (0.1-s frame) was selected. The sample included both normal data and left- and right-side lane departures. Response Variables Two models were developed, one with a response variable for right-side lane departures and the other with a response vari- able for left-side lane departures. Explanatory Variables All of the driver, roadway, and environmental variables in Tables 6.1 and 6.2 were evaluated in different combinations. Variables that were expected to be correlated were not evalu- ated at the same time. Modeling Approach and Results Description of Classification and Regression Tree Model A classification and regression tree model was the data mining modeling approach selected. Classification methods assign objects to predefined categories (Tan et al., 2006). Tree-based models are used for both classification and regression. A tree- based analysis uses a response variable (Y) that can be either quantitative or qualitative and a set of classification or predic- tor variables (Xi) that may be a mixture of ordinal or nominal variables. For classification trees the response is categorical, and for regression trees the dependent variable is quantitative (Nagpual, 2009). Classification and regression trees use algo- rithms to determine a set of if-then logical split conditions that divide the data into subsets. One of the advantages of regres- sion tree analysis over traditional regression analysis is that it is a nonparametric method that does not require assumptions of a particular distribution and is more resistant to the effects of outliers; splits usually occur at nonoutlier values. Tree mod- els are nonlinear, indicating that there is no assumption about the underlying relationships between the response and explana- tory variables. In addition, independent variables do not have to be specified in advance. A regression tree selects only the most important independent variables and the values of those variables that result in the maximum reduction in deviance. Another advantage is that results are invariant with respect to monotone transformations of the independent variables. Thus, the researcher does not have to test a number of trans- formations to find the “best” fit (Roberts et al., 1999). The regression tree also allows relationships between variables to be uncovered that may not be determined using other meth-ods (StatSoft, 2008). For instance, shoulder width may be rel- evant in determining whether a right-side lane departure results in a lane departure crash on curves of a certain radius but may not be relevant for tangent sections or curves with larger radii. S-PLUS Statistical Software’s (Version 8.0.4) classification and regression tree analysis was used to evaluate the data. Regression tree rules are determined by a procedure known as recursive partitioning, which iteratively generates a tree struc- ture by splitting the sample data set into two subsets accord- ing to two rules. First, the independent variable that produces the maximum reduction in variability is identified. Next, the value of the variable that results in the maximum reduction in variability is selected (Wolf et al., 1998). Figure 6.5 shows an example of a classification and regres- sion tree analysis that was used to determine the factors related to high accelerations at intersections in a model to predict vehicle activity for emissions modeling (Hallmark et al., 2002). As indicated, the tree split on the variables queue position, approach grade, and distance to the nearest downstream sig- nalized intersection. A vehicle that was on a segment with a downstream distance of less than 902 ft, that was in a queue position less than 2 (first in queue), and that was on approach with a grade of −1% or lower had an acceleration of 10.85 ft/s2 starting up from the intersection stop line. As shown, grade was only relevant for vehicles in queue positions 1, 2, and 3, and distance to the nearest downstream signalized intersection was only relevant for vehicles in queue position 1.Figure 6.5. Example of a classification and regres- sion tree used to model vehicle acceleration for emissions modeling.In growing a regression tree, the binary partitioning algo- rithm recursively splits the data in each node until the node is homogenous or until a minimum criterion such as number of observations is met. If left unconstrained, a regression tree can grow until it results in a complex model with a single obser- vation at each terminal node that explains all the deviance.

86However, for application purposes, it is desirable to create an end product that balances the model’s ability to explain the maximum amount of deviation with a simpler model that is easy to interpret and apply. To simplify the final model, the user can set values such as the minimum number of obser- vations present before a split occurs or minimum deviance allowed at each node. Default values may also be used. Three other functions in S-PLUS can be used to simplify the tree, as described below. Pruning reduces the nodes on a tree by successively snip- ping off the least important splits. The equation to determine the importance of a subtree using a cost-complexity measure is as follows (Insightful Corporation, 2007): where Dk(T′) is the deviance of the subtree T ′, k is the cost-complexity parameter, and size(T′) is the number of terminal nodes of T′. Cost complexity pruning selects the subtree T ′ which mini- mizes Dk(T ′) over all subtrees. The second function that can be used to simplify the model is shrinking, which reduces the number of effective nodes. This is accomplished by shrinking the fitted value of each node toward its parent node using the following algorithm (Insightful Corporation, 2007): where k is the shrinking parameter (k may be a scalar or a vector, 0 < k < 1), (node) is the usual fitted value for a node, and yˆ(parent) is the shrunken fitted value for the node’s parent. Snipping allows the user to interactively remove nodes and try various modifications to the original model. The effects of using any of the procedures (pruning, shrinking, snipping, modifying the minimum number of observations, modifying the minimum node size, or modifying the minimum node deviance) can be evaluated by observing normal probability plots of the residuals for the tree object, comparing residual mean deviance for different models, or inspecting a plot of the reduction in deviance with the addition of nodes. The residual mean deviance (rmd) is an indicator of regression tree fit and is the statistic reported rather than the traditional r2 value in linear regression analysis. The rmd is the mean deviance of the data samples in the terminal nodes of an esti- mated tree model. A lower value for rmd indicates a better fit (Roberts et al., 1999). Ɲ ˆ ˆ ( . )y node k node k y parent( ) = ( )+ −( ) ( )• •1 6 2Ɲ D T D T k size Tk ′( ) = ′( )+ ′( )• ( . )6 1Analysis Approach Classification and regression tree analysis methods were used to identify variables with the most explanatory power in influ- encing the occurrence of a right- or left-side lane departure. A separate model was created for right- and left-side departures. For each model, a variety of explanatory variables were evaluated in different combinations. Variables that did not appear in one of the main branches of the regression tree being evaluated were removed, and other combinations of variables were evaluated. Initial models resulted in complex trees. For example, the initial left-side lane departure model is shown in Figure 6.6. As indicated, the tree model is com- plex. S-PLUS plots the tree structure so that the more impor- tant the parent split, the farther the children node pairs are from the parents. This information and a plot of the deviance as a function of the number of nodes and cost-complexity parameters were used to evaluate the most relevant splits. The snip and prune tree functions in S-PLUS were used to develop the final models shown in Figures 6.7 and 6.8.Results As shown in Figure 6.7, the most relevant explanatory variables for left-side lane departures were radius of curve, driver age, and shoulder type. The values at the end of the tree nodes indi- cate the type of lane departure. The value 11 was used for normal driving, and the value 31 was used to indicate that a left-side lane departure had occurred. The numbers showed trends only: the higher the node value, the more likely a left- side lane departure would occur. The values do not correspond to an actual probability and are an artifact of the model used to develop the regression trees. With significantly more data, the model could have been developed so that the probability of lane departure was the node value. Because there was not a sub- stantial amount of data, the tree should only be interpreted to show a general pattern and break points for variables where relationships are emerging. For instance, for the left-side lane departure, age was relevant when curve radius was less that 1,081 ft but was not relevant for curve radii greater than this. As indicated in Figure 6.8, the most relevant explanatory variables for right-side lane departures were also radius of curve, driver age, and shoulder type. The values at the ends of the tree nodes indicate the type of lane departure. The value 11 was used for normal driving, and the value 21 was used to indicate that a right-side lane departure had occurred. The higher the node value, the more likely a right-side lane departure would occur. Estimating Sample Size for the Full-Scale Study As indicated, only limited data were available to evaluate the lane departure research questions. In this section, sample size

87Figure 6.6. Initial tree model for left-side lane departures.Figure 6.7. Final tree model for left-side lane departures.for the full-scale study is addressed using classification and regression trees. Appropriate sample size for classification and regression trees is not easily determined. Sample size depends on factors such as number of variables, deviance at each node, complexity of the model, and minimum specified node size. Most model pack- ages set some minimum default node size. In S-PLUS, the default node size is five observations. Morgan et al. (2009) evaluated methods to test sample size for decision-tree analysis. The authors indicate that when data sets are too large, decision trees may overfit. They also foundthat accuracy decreases as sample size increases. From the data they evaluated, they found that relatively stable patterns emerged between 8,000 and 16,000 samples with models that had a large number of variables to evaluate. The authors also describe other work to evaluate sample size. Application to Full-Scale Study A sample-based approach is expected to be the best data sampling method for classification and regression tree analy- sis. Use of continuous data would require reduction of a

88Figure 6.8. Final tree model for right-side lane departures.large amount of data, which would be extremely resource- intensive. The main advantage to data mining is that it will be useful in uncovering relationships in the data that may not be found using other methods. Additionally, data mining can also eval- uate a large amount of data using an automated process. One disadvantage is that this modeling approach is not common among practitioners. It will be necessary to interpret results so that practitioners can incorporate the information into decision-making models, such as comparing the costs and benefits of a particular countermeasure. Analysis Approach 2: Odds Ratio and Logistic Regression The second analysis approach was to calculate odds ratios, as described in this section. Description In this approach, both a simple odds ratio test and logistic regression were used to identify factors related to lane depar- tures. An odds ratio compares the probability of an event hap- pening with the probability of the same event not happening. Logistic regression evaluates the association between a binary response and explanatory variables. The natural logarithm of the odds is related to explanatory variables using a linear model. The difference between the approaches used here and the case-control used later is in the assumptions that are made for modeling. The odds ratio and the logistic regression approaches used here assume random independent sampling from a spe-cific event (either left-lane-departure or right-lane-departure) and random independent sampling of normal driving epochs. Each epoch was treated as an individual observation. Since the same driver may have been represented in more than one epoch, correlations between epochs may have existed. However, since this was an exploratory analysis, these two approaches used all available epochs and made the assumption of independence for simplicity. In future analyses, when larger data sets are available, researchers could test if the correlation is zero. If the correlation is not zero, several adjustments could be considered. For example, researchers could model the cor- relation structure or use the paired case-control approach. Data Sampling Approach Data were reduced as described in the previous section on data mining (p. 84). Each lane departure event or normal driving epoch was modeled as one observation. However, the models were weighted by the number of 0.1-s intervals for each event or epoch. Response Variables Occurrence of a lane departure was the response variable. Right-side and left-side lane departures were modeled separately. Explanatory Variables A number of explanatory variables were available, as shown in Tables 6.1 and 6.2. When variables were highly likely to

89be correlated, only one variable was evaluated. For instance, ambient conditions and roadway surface condition are highly correlated. Road surface condition was used because it is more likely to have an impact on whether a driver has a lane depar- ture. Variables were not included in the simple odds ratio when there were not enough observations to calculate an odds ratio. Simple Odds Ratio Modeling Approach and Results Simple odds ratios were calculated using Equation 6.3: where OR = odds ratio, RDj = number of observations for situation j where lane departure occurs, RDk = number of observations for situation k where lane departure occurs, NDj = number of observations for situation j where no lane departure occurs, and NDk = number of observations for situation k where no lane departure occurs. The 95% confidence interval was calculated using Equa- tions 6.4, 6.5, and 6.6: and CI of OR is exp log OR sd( ) ±( )1 96 6 4. ( . ) OR = RD RD ND ND j k j k ( . )6 3The simple odds ratio only allows for two responses within a variable (e.g., rumble strips present or not). Therefore, when a variable had several responses, an odds ratio was calculated for each response if there were sufficient values. For instance, curve type had three responses: no curve, left-hand curve, and right-hand curve. As a result, presence of left-hand and right- hand curves was compared against tangent sections. The results of this approach are presented in Table 6.3. Cat- egories were created for numeric variables such as radius, as shown in Table 6.3, to create two responses. When numeric variables could not easily be combined into categories, they were not included. standard deviation of log odds ratio 1 RD ( ) = +j 1 RD 1 ND 1 NDk j k+ +( )0 5 6 6. ( . ) standard deviation of log odds ratio 1 1 ( ) = +A B + +( )1 1C D 0 5 6 5. ( . )Left-Side Departure vs. Normal Right-Side Departure vs. Normal Variable Odds Ratio Confidence Interval Odds Ratio Confidence Interval Radius < 200 m vs. tangent 10.9 (9.7, 12.3) 29.2 (25.4, 33.5) 400 m > radius ≥ 200 m vs. tangent 32.8 (30.1, 35.9) 10.9 (9.6, 12.4) 600 m > radius ≥ 400 m vs. tangent 19.7 (17.8, 21.8) 22.1 (18.8, 25.9) 600 m ≥ radius vs. tangent 20.4 (18.4, 22.7) 13.6 (11.3, 16.4) Left-hand curve vs. tangent 5.1 (4.78, 5.54) 3.84 (3.37, 4.38) Right-hand curve vs. tangent 2.95 (2.71, 3.20) 6.60 (5.93, 7.34) Wet vs. dry roadway 0.97 (0.9, 1.1) Not enough samples Day vs. night/dusk 1.8 (1.7, 1.9) 0.38 (0.3, 0.4) Male vs. female 1.28 (1.19, 1.38) 1.14 (1.02, 1.28) Gravel vs. paved/partially paved 9.83 (8.37, 11.54) 0.16 (0.14, 0.18) Earth vs. paved/partially paved 5.36 (4.55, 6.32) 0.07 (0.06, 0.08) Table 6.3. Results of Simple Odds RatioAs indicated, radius of curvature was highly relevant in the occurrence of right- and left-side departures. Left-side and right-side lane departures were 10.9 times and 29.2 times more likely to occur on curves with a very small radius (less than 200 m) than on a tangent section. Lane departures were also much more likely to occur on other curve radii as shown. Curve direction from the perspective of the driver (left curve vs. right curve) was also relevant in determining the occurrence of both right- and left-side lane departures. The odds of a left- side lane departure on a left-hand curve were 5.1 greater than on a tangent section, and for a right-hand curve the odds were 2.95 greater. The odds of having a right-side lane departure

90were 3.8 for left-hand curves and 6.6 for right-hand curves. Weather and time of day did not appear to be relevant because the odds ratio was close to 1.0. Men were slightly more likely than women to be involved in both types of lane departures (1.3 for left-side lane departure, 1.1 for right-side lane depar- ture). Shoulder type appeared to be relevant for left-side but not right-side lane departures. It should be noted that a left- side lane departure on a curve can be intentional (cutting the curve). It was not possible to distinguish between intentional and unintentional lane departures, except for events like a vehicle changing lanes to avoid a parked car or object in the roadway. These lane departures were removed but others, as indicated, could not be identified. Logistic Regression Modeling Approach and Results Multivariate logistic regression was used to examine factors associated with the risk of both left- and right-side lane depar- tures using data summarized from the UMTRI data set, as described in the previous section on data mining (p. 84). In each model, the records for lane departures were used as the cases, while records without lane departures (normal driving) were used as the controls. Separate models were created for left- and right-side lane departures. Both models were created using the LOGISTIC procedure in the SAS/STAT 9.2 software package. The response variable was presence of a lane departure, Z, given as 0 if there is no lane departure (normal driving) and 1 if a lane departure occurred. The models for right- and left-side lane departures were created using the following logic. Occurrence of a lane depar- ture Z is the response variable. Z is a Bernoulli variable with p = P(Z = 1) as the probability of occurrence of a lane depar- ture. Therefore, p/(1 − p) is the odds of a lane departure hap- pening. In order to link the odds of a lane departure to the explanatory variables investigated (X’s), the logit link func- tion was used. Hence, a connection between the probability of a lane departure and the linear combination of predictor variables (X’s) using Equation 6.7: Stepwise selection was used to determine which variables were relevant and should be included in the model. For each step, a covariate was added to the model if the significance level for entry was met (0.1 was used). Then the chi-square statistic was computed. If the covariate satisfied the signifi- cance level (0.1), it was included in the model. The Akaike information criteria (AIC) and Schwarz criterion (SC) were used to compare models and determine which variables to include in the final model. Only a small sample of left- and right-side lane departures was available. As a result, it was not possible to evaluate the logit . . . kp p p X Xk( ) = −( )( ) = + +log ( . )1 6 70 1 1β β βsignificance of all variables and test correlations between vari- ables. In order to build a model that best represented the data, the decision to remove variables from the model was based on whether correlation among input variables was expected. The maximum likelihood (ML) method was used to calculate the coefficient estimates, and the Wald statistic was used to test the significance of covariates. The variable Observation was used as the frequency vari- able in the model, which indicated the frequency of occur- rence of each observation. This variable was used to weight the model. Odds ratios were used to assess whether a specific condition was more or less likely to result in a lane departure. An odds ratio greater than 1 indicated that the odds of a lane departure occurring are higher, and an odds ratio less than 1 revealed lower odds. Left-Side Lane Departures Equation 6.8 describes the final model for left-side lane depar- tures. The estimated log (odds) is given by Model statistics are provided in Table 6.4, and the odds ratio estimates are shown in Table 6.5. Log odds AGE I GENDER ( ) = − + = 1 6107 0 0105 0 1682. . .  1 0 00025 0 7823 male Radius LaneWidth ( )[ ]− − − . .   0 3067 0 20 9528 . .  I TimeOfDay day CrashDe = ( )[ ]− nsity ( . )6 8The first variable in Equation 6.8, AGE, is driver age. As indicated, as driver age increases, the odds for a left-side lane departure decrease, which indicates that involvement in left- side lane departures decreases with age. Results for GENDER show that drivers involved in left-side lane departures are 1.4 times more likely to be male drivers (condition 1) than female drivers (condition 2). The variable Radius is the radius of a curve. A very large value of 9999 was used for tangent sections, and the variable was modeled as a continuous variable. Table 6.5 shows that as radius increases, the likelihood of a lane departure decreases. A negative correlation between LaneWidth and the likeli- hood of a left-side lane departure suggests that as lane width increases, the odds for the left-lane departure decrease. The result for the variable TimeOfDay is a comparison of the odds of having a left-side lane departure during the day compared with at night. As shown, the odds ratio of having a left-lane departure during the day (condition 0) compared with at night (condition 1) is 0.542, indicating that a left-side lane departure is less likely to happen during the day. Alter- natively, the odds of a lane departure at night compared with during the day are 1/0.542 = 1.85.

91Intercept Criterion Only Intercept and Covariates AIC 35815.581 29956.217 SC 35825.176 30023.384 −2 Log L 35813.581 29942.217 Association of Predicted Probabilities and Observed Responses Percent 80.0 Somers’ D 0.624 concordant Percent 17.6 Gamma 0.640 discordant Percent tied 2.4 Tau-a 0.047 Pairs 442473680 c 0.812 R-Square R-square 0.0526 Max-rescaled R-square 0.1873 Hosmer and Lemeshow Goodness-of-Fit Test Chi-square DF Pr > ChiSq 445.9524 8 <.0001 Table 6.4. Model Fit Statistics for Left-Side Lane DepartureVariable Condition Estimate Std Error p-value OR 95% Lower OR Estimate OR 95% Upper Age 1 −0.0105 0.00123 <.0001 0.987 0.990 0.992 Gender 1 vs. 2 0.1682 0.0204 <.0001 1.292 1.400 1.517 Radius 1 −0.00025 3.637E-6 <.0001 1.000 1.000 1.000 LaneWidth 1 −0.7823 0.0712 <.0001 0.398 0.457 0.526 TimeOfDay 0 vs. 1 −0.3067 0.0178 <.0001 0.505 0.542 0.581 CrashDensity 1 −20.9528 3.5571 <.0001 <0.001 <0.001 <0.001 Table 6.5. Results for the Left-Side Lane Departure ModelThe last variable, CrashDensity, indicates the odds of a lane departure based on the density of lane departure crashes along the segment. As shown, the probability of having a left- side lane departure decreases as the density of lane departure crashes increases. The results of this variable are counter to what was expected. It was expected that roadway sections with a high density of lane departure crashes would be more likely to have lane departures. Right-Side Lane Departure Crashes Equation 6.9 describes the final model for right-side lane departure events. The estimated log (odds) of a right-side lane departure is given by Model statistics are provided in Table 6.6, and the odds ratio estimates are shown in Table 6.7. Log odds AGE Radius( ) = + − −0 1679 0 0427 0 00025. . .  1 4042 0 8994 3 2 93 . . . LaneWidth ShldType− =( )−I 45 4 1 8799 6 0 36  I IShldType ShldType=( )+ =( ) + . . 5 7 0 2864 0 81 9775    I IShldType Time C =( )− =( ) + . . rshDen OvrSpd+ 1 8348 10 6 9. ( . )The positive estimate for the variable Age in Equation 6.9 indicates that as age increases, the odds of a right-side lane departure also increase. This is the opposite of the result for left-side lane departures. The negative estimate for the variable Radius indicates that as radius increases, the odds of having a right-side lane depar- ture decrease. As a result, the likelihood of having a right-side lane departure is greater on curves with smaller radii. The coefficient for LaneWidth indicates that as the lane width increases, the odds for a right-side lane departure decrease. The variable ShldType indicates that the type of shoulder is significant. Nonpaved shoulder types were compared with paved shoulders (condition 1). The odds of having a right- side lane departure when gravel shoulders (condition 3) were present compared with when paved shoulders were presentare 0.08. The odds of having a right-side lane departure when earth shoulders (condition 4) or partially paved shoulders (condition 7) were present compared with when paved shoul- ders were present are 0.01 and 0.29, respectively. Hence, the odds of a right-side lane departure are greater on paved shoul- ders than on earth, gravel, or partially paved shoulders. The odds of having a right-side lane departure when very narrow shoulders (condition 6) were present compared with when paved shoulders were present are 1.34. However, the confi- dence interval contains 1, so the difference is not statistically significant. In addition, there were very few observations where no shoulder was present. The variable Time represents time of day. The odds of a right- side lane departure during the day (condition 0) compared with at night (condition 1) are 0.564. Alternatively, the odds of a

92Intercept Criterion Only Intercept and Covariates AIC 17921.931 11629.157 SC 17931.503 11734.452 −2 Log L 17919.931 11607.157 Association of Predicted Probabilities and Observed Responses Percent 94.0 Somers’ D 0.884 concordant Percent 5.6 Gamma 0.887 discordant Percent tied 0.3 Tau-a 0.029 Pairs 183668320 c 0.942 R-Square R-square 0.0578 Max-rescaled R-square 0.3717 Hosmer and Lemeshow Goodness-of-Fit Test Chi-square DF Pr > ChiSq 1510.8238 8 <.0001 Table 6.6. Model Fit Statistics for the Right-Side Lane Departure ModelVariable Condition Estimate Std Error p-value OR 95% Lower OR Estimate OR 95% Upper Age 1 0.0427 0.00286 <.0001 1.038 1.044 1.050 Radius 1 −0.00025 6.192E-6 <.0001 1.000 1.000 1.000 LaneWidth 1 −1.4042 0.1140 <.0001 0.196 0.246 0.307 ShldType 3 vs. 1 −0.8994 0.0647 <.0001 0.069 0.083 0.101 ShldType 4 vs. 1 −2.9345 0.1048 <.0001 0.008 0.011 0.014 ShldType 6 vs. 1 1.8799 0.1220 <.0001 0.967 1.338 1.850 ShldType 7 vs. 1 0.3650 0.0519 <.0001 0.252 0.294 0.343 Time 0 vs. 1 −0.2864 0.0402 <.0001 0.482 0.564 0.660 CrshDen 1 81.9774 7.9325 <.0001 >999.999 >999.999 >999.999 OvrSpd10 1 1.8347 0.0696 <.0001 5.464 6.264 7.180 Table 6.7. Results for the Right-Side Lane Departure Modelright-side lane departure at night can be computed by 1/0.0564 = 1.77. Hence, the odds of having a lane departure at night are 1.77 times the odds of having one during the day. The variable CrshDen is the number of actual lane depar- ture crashes per meter for the section of roadway where the vehicle activity took place. The result shows that as crash den- sity increases, the odds of having a right-side lane departure increase dramatically. This is also the opposite of what was found for left-side lane departures. The last variable, OvrSpd10, indicates the amount of time a driver spends going 10 mph over the speed limit. The results indicate that drivers who spend more time traveling 10 mphor more over the speed limit increase their odds of having a right-side lane departure. Sample Size for Full-Scale Study As indicated, only limited data were available for evaluating the lane departure research questions. In this section, a method to estimate sample size for the full-scale study is presented for the logistic regression. A literature review regarding sample size indicated that there are various schools of thought on determining sample size for logistic regression. References include Hosmer and Lemeshow (2000, 339–347), Agresti (2002, 242–243), and Hsieh et al. (1998, 1623–1634). Calculation of sample size for logistic regression can be com- plicated because multiple logistic regression analysis is non- linear. Hsieh et al. (1998) suggest a method for simplifying sample-size calculation. Based on their method, the following describes an example calculation of sample size for the two logistic regression models presented in the previous section. Left-Side Lane Departure In order to calculate sample size for left-side lane departures, one of the explanatory variables that is of interest is first chosen (e.g., LaneWidth, termed as X). The sample size is calculated according to the following equations: where π is the estimated probability when all the continuous vari- ables are at their means, calculated as 0.03184. δ τ τ τ= + +( ) ( )[ ] + −( )[ ]1 1 5 4 1 4 6 112 2 2exp exp ( . ) n z z= + −( )[ ] +( ) ( ) −( )[ − −1 1 2 2 2 24 1 2 1α β τ πδ πτ ρexp  ] ( . )6 10

93τ is the effect of X at the mean level of the other predictors. For example, to determine the necessary sample size for detecting that the effect of a one standard deviation increase in lane width results in a 50% increase in the odds of left lane departure, with all other continuous variables at their mean values, then τ = log(1.5). z1−α and z1−β are the (1 − α) and (1 − β) standard normal quantiles, respectively. α is the level of significance, which is 0.05 here. 1 − β is the power, which is 0.9 here. ρ is the multiple correlation of X and the remaining covari- ates in the model. The R2 in linear regression can be used to measure ρ, which is 0.1621 here. Inserting all of the above values into Equation 6.11 provides an estimated sample size for the logistic regres- sion, which is 1,663. Right-Side Lane Departure Similarly for the right-side lane departure model, sample size is calculated using π = 0.01634 ρ = 0.1505 τ = log(1.5) The sample size is determined to be 3,109 using Equations 6.10 and 6.11. Application to Full-Scale Study In order to apply logistic regression to the full-scale naturalis- tic driving study, the following approach may be considered. A sequential block approach may be used to reduce the data. The first step would be to identify all lane departure crashes, near crashes, and encroachments that meet the requirements of the research question. For instance, only right-side lane departures on four-lane, rural, divided roadways may be included. A set amount of time, an epoch, would be deter- mined based on the average length of lane departure. For instance, the epoch could comprise 3 s before the lane depar- ture and 3 s after, resulting in a 6-s interval. Normal driving data could be sampled at regular intervals (e.g., 5 min) and data aggregated for that epoch. For instance, if a 6-s epoch was selected, all lane departure events would be extracted, data for vehicle activity meeting the criteria would be sampled every 5 min, and 6 s of data would be reduced for that interval. Driver, roadway, and environmental conditions would need to be consistent across the epoch. For instance, if the driver were traveling on a tangent section at the beginning of the epoch and then encountered a curve after 2 s, the epoch would have to be adjusted to include just the tangent section or the curve. Logistic regression analysis is ideal for the naturalistic driv- ing study because normal driving data will be provided thatcan be used to account for exposure. Historically, it has been difficult to account for driver activity under a range of situa- tions to determine if one situation is overrepresented. For instance, it is commonly accepted that crashes are more likely during a winter weather event. However, it is very difficult to determine what fraction of time drivers spend driving on snowy or icy roads, so it is difficult to determine whether crashes under these conditions are overrepresented. Additionally, the results of logistic regression can be expressed as odds ratios, which can easily be explained to lay persons and used by transportation agencies. Analysis Approach 3: Logistic Regression for Correlated Data The previous section described an analysis approach using logistic regression to evaluate the odds of having a left- or right- side lane departure based on a small sample of available data. Because the sample size was small, it was difficult to address issues such as the correlation between data that occurs when repeated samples are taken from the same situation (e.g., repeated samples for the same driver, same trip). This section provides an alternate approach using logistic regression con- sidering correlated data. Description This approach considers matched control and event samples to avoid confounders. For each selected case epoch, several matched controlled periods of the same length are sampled from the same driver and same trip. Other covariates not selected in the model (either because of not recorded or not enough data to have a good estimate) are assumed to be con- stant in the same trip of the same driver. For example, a sleepy driver in a trip is sleepy the whole trip, not just at the end of the trip. The effects of these covariates are thus elim- inated in this matched case-control model. Further, each epoch is assumed to be separated (not adjacent), so there is an assumption of independence within each matched case- control set. The conditional logistic regression model focuses on esti- mating the differences within these matched sets. The goal is to understand the association between covariates (either envi- ronmental, driver related, roadway related, or vehicle related) on the probability of an event. Each period includes a response variable that takes on the values Y = 1 (event) or Y = 0 (no event) and candidate explanatory variables whose distribu- tion of values within the epoch can be summarized using, for example, the observed range (max-min) of values of the covariate. Consider YijXij ∼ Bernoulli(pθ(Xij)), the response in the jth sample of the ith driver. Assume that j = 1 corresponds to the case. Let Yi = (Yi1, Yi2, . . . , Yini) and Xi = (Xi1, Xi2, . . . , Xini).

94The likelihood function is given by Li(θYi) = f(Yi Xi) / [Σpermutation of yi f(Y Xi)] = [Πj∈control exp(x − ijβ)] / [Σchoose(ni,mi) Πj∈control exp(xijβ)]. The following example combines left-lane and right-lane departures as events. The significant positive model coeffi- cients suggest that there exists a correlation between an event and time periods when a driver exhibits a large variation in lateral speed and lateral acceleration. Data set: Driver 6 8 12 24 48 51 60 Total Number of sampled 11 3 21 15 16 6 16 88 periods (1 case, others control) If the variable LaneOffset is available: exp se coef (coef) (coef) z p max(LaneOffset) − 8.37 4317 3.05 2.74 0.0061 min(LaneOffset) max(AY) − min(AY) 6.63 755 3.01 2.20 0.0280 Likelihood ratio test = 19.4 on 2 df, p = 6.21e-05, n = 88. If the variable LaneOffset is not available: exp se coef (coef) (coef) z p max(LATERALSPEED) − 4.91 136.0 2.44 2.01 0.044 min(LATERALSPEED) max(AY) − min(AY) 3.63 37.7 2.83 1.28 0.200 Likelihood ratio test = 14.2 on 2 df, p = 0.000817, n = 88. The correlation within is an important issue for the longitu- dinal data. Not dealing with this correlation can cause biased estimates and underestimated standard deviation (suppose positively correlated). The following sections describe another model that assumes hierarchical structure to deal with the cor- relation within trips and nested in drivers. While the matched case-control method assumes independent matched set and tries to eliminate the correlation, the following method puts the correlation structure in the model. Sample Given the example above, suppose now that periods are selected from both run-off-road (ROR) and non-ROR events under some common fixed condition, such as the same cur- vature or the same weather conditions. Samples are selected from all qualifying periods.Difference from the Matched Case-Control (the Conditional Logistic Model) For the periods in the preceding section, the data set is con- structed by sampling from the population of cases and the population of controls, even though observations may be cor- related (e.g., they could be sampled from the same trip). Question of Interest The question that may be answered with this approach is: What factors may be associated with the risk of ROR events? Response Variable The response variable is a ROR event. Covariates Any variable for which measurements are available can be included in the model as an independent variable. These might include, for example, driver characteristics, environ- mental conditions, and road conditions. Model The model is described by the following: Let Yij = 1 if the jth sample (period) for the ith driver has ROR = 1. Assume that the distribution of Yij is Bernoulli with logit(P(Yij = 1Xij)) = Xijβ + Zijγi. Here, Xij denotes the covariates corresponding to the jth period for the ith driver, and γi is a (multivariate) nor- mal distributed random variable that is driver-specific. This random variable permits accounting for the correlation between observations within the same driver data. Model Example For the example, continuous periods longer than 5 s (ROR event periods or nonevent periods) are selected. For each period, the middle 5 s are selected and the variables of inter- est within each period are summarized. Then, consider a ran- dom intercept for “driver” and a second random effect to represent “trip nested within driver” in the mixed-effect model and use forward selection of covariates based on the AIC, as shown in the following: Fixed effects (R: glmer) Estimate Std. Error Z value Pr(>z) (Intercept) −8.662 1.880 −4.61 4.1e-06 Mean of shoulder 0.995 0.400 2.48 0.013 width Max(LaneOffset) − 3.346 1.986 1.68 0.092 min(LaneOffset) Max(YawRate) − 0.592 0.330 1.80 0.073 min(YawRate)

95Estimated using generalized estimating equations (GEE) (R: geepack: geeglm) Estimate Std. Error Wald Pr(>W) (Intercept) −7.721 2.112 13.36 0.00026 Mean of shoulder 0.819 0.500 2.69 0.10119 width Max(LaneOffset) − 2.419 1.544 2.45 0.11728 min(LaneOffset) Max(YawRate) − 0.681 0.248 7.51 0.00614 min(YawRate) The statistical model R was used to estimate the example, and the functions used are shown in parentheses. Note that by introducing the random effects into the model, inferences about the association between lane offset and the probability of an event and between yaw rate and the probability of an event are impacted and changed from sta- tistically significant to statistically insignificant. Sample Size Estimating sample size in generalized linear mixed models is, in general, not a straightforward endeavor. Dang et al. (2008) and Liu and Liang (1997) derived the exact form of the sam- ple size estimator for the two-sample problem with correlated binary responses and exchangeable correlation structure by finding the approximating variance of the regression coeffi- cient. Maas and Hox (2005) presented a simulation result for models with one random coefficient and one random slope at different sample sizes. Sample Size Calculation Using a Generalized Estimating Equations Method and a Simple Example The following provides an example sample size calculation. Consider an additional explanatory variable, OvrSpd5, that is associated with driver behavior (frequency of driving over the speed limit). A mixed model was fit using GEE as described for the example above, and the regression coefficient associ- ated with OvrSpd5 was not found to be significantly different than zero. Because the p-value for the hypothesis that the regression coefficient is equal to zero is 0.9338, the null hypothesis H0: β4 = 0 is not rejected. If the alternative hypoth- esis (H0: β4 = βa) happens to be true, then we would like to have enough power to reject the null hypothesis. Assume that the estimated coefficient β4 = −0.0136 is correct and that the standard error 0.1543 is correct under the current sample size. We want to increase the sample size to reduce the stan- dard error enough so that we can achieve Type I error <0.05 when the true value of β4 is 0 and the power of the test is atleast 0.8 when the true value of β4 is −0.0136. The coefficients for this example are described as follows: Coefficients (R: geepack: geeglm) Estimate Std. Error Wald Pr(>W) (Intercept) −7.3837 5.1027 2.09 0.1479 Mean of shoulder 0.8218 0.4830 2.90 0.0888 width Max(LaneOffset) − 2.3889 1.6127 2.19 0.1385 min(LaneOffset) Max(YawRate) − 0.6792 0.2604 6.80 0.0091 min(YawRate) OvrSpd5 −0.0136 0.1643 0.01 0.9338 To obtain an estimate of the appropriate sample size under those conditions, some assumptions need to be made. These are as follows: 1. There is a fixed number of drivers indexed by s = 1, 2, . . . , S. 2. Each driver has repeated measurements t = 1, 2, . . . , T. 3. The correlation structure is “exchangeable.” This means that every pair of samples in the same subgroup has the same correlation. 4. We are interested in testing the hypothesis H0: HA = h0 versus H1: Hβ ≠ h0. H is (0, 0, 0, 0, 1) and h0 in this exam- ple is 0. 5. We let b denote the point estimate of β and let cov(b) = T−1Vb (where the covariance matrix can either be model based or can be estimated using robust methods). Then, the Wald test statistic Q = T(Hb − h0)′[HV(b)H′]−1(Hb − h0) is asymptotically distributed as a χ2(p, λ) random vari- able with df = p and a noncentrality parameter λ = 0 under H0 and λ = λH1 under H1. The power of the test is PH1(Q > χ2(p, 0)1−α), where α is the significance level. The sample size is calculated by finding the minimum n such that the PH1(Q > χ2(p, 0)1−α) achieves the desired power level. Comparison of the Logistic Regression Model Both methods are trying to deal with the correlation struc- ture. The matched case-control method uses fewer samples; the mixed effect model can use more samples but needs to make assumptions on the correlation structure and estimate extra coefficient for correlations. Currently, both models can extract information from this pilot data set. For a larger scale of study, when researchers can afford to deal with the corre- lation structure, the mixed effect model may provide more information.

96Analysis Approach 4: Time Series Analysis Different from more common case-control study, the natu- ralistic data provides more than just counts of events. The purpose of using a dynamic model that puts interests on each 0.1 s includes modeling the pattern of driving and providing information “on” (while) driving. For example, we know lane offset is correlated to lane depar- ture events. When the time window is small, the car is out of the road during the event, so the measurement of lane offset is different from normal driving. When the time window is slightly larger, the averaged lane offset has not crossed the edge of the road, and there is no difference between “driver feels comfortable to stay close to edge at this section” and “driver is going to cross the edge next second.” The “random” (distribution) explains the different outcomes with the same explanation variables as “randomness.” We could also look at the data in another perspective. We look at one instant while driving and think the following actions as the results of current status, driver’s decision and operation, environment effect, and some randomness. Assume the current status is fixed and observed. Other factors are changed over time. If we could build a model, we could fore- cast a few seconds ahead. We might be able to determine some conflicts—for example, in danger but not reacted, or in danger and reacted but not enough. This example is rather simplified. In the larger study, this model needs several longer continuous mechanical-recoded data that are known in closed situations (e.g., similar lane type) to train the basic model and test on the shorter manu- ally collected data (e.g., from video, radius). The fourth analysis approach used continuous data in a time series model, as described in this section. Description The main advantage of applying a time series analysis to nat- uralistic driving study data is that it allows relationships between variables across time to be incorporated into the model. As a result, relationships can be established between, for example, driver distraction in previous time periods and probability of a lane departure or crash in a subsequent time period. A time series model can also be used to model out- come. Current methods, which use crash data to analyze the impact of countermeasures on safety, have only accom- plished their goals by waiting for the system to fail (i.e., a crash occurs). In contrast, a time series analysis allows posi- tive outcomes to be evaluated and relationships between positive outcomes and roadway, driver, or environmental features to be determined.Time series models are extensions of regression models, where the errors are assumed to be correlated; thus, selection of independent variables to be included in the model and the form of the association between independent and dependent variables in the model can be addressed in a standard fashion. Sampling Approach To demonstrate this analysis approach, data were modeled using continuous data (i.e., each observation represents 0.1 s of vehicle activity). All of the variables listed in Tables 6.2 and 6.3 were available but, because of the complexity of a time series model, only a few initial variables were included to demonstrate proof of concept. Response Variables The response variable is vector-valued. The first element of the vector is a variable associated with movement (e.g., lane offset, yaw rate), whereas the second element is a variable associated with the operation of the vehicle (e.g., accelera- tion). We use Y1(t) and Y2(t) to denote the first and second elements of the response vector. Explanatory Variables The model can include continuous, block-summarized, or static covariates. It can also include smoothed functions of independent variables that may vary by periods. We use X(t) to denote the value of the vector of covariates at time t. There is no restriction on the number and type of covariates that can be included in the model. In particular, we can include driver, roadway, vehicle, or environmental variables and explore the association between them and the response variable while at the same time accounting for the correlation of consecutive observations obtained from the same process. Modeling Approach and Results We assume that E(Y1(t)) = g(Y1(<t), X(≤t), Y2(<t)). That is, the mean of the movement response variable at any given time t depends on movement at times preceding t, on the covariates up to and including their value at time t, and on the operation response variable at times prior to t. To complete the specification of the model, it is necessary to define the ker- nel of the model (i.e., the functional form for g in the equa- tion), the size of the lag (the number of observation periods for which correlation coefficients will be estimated), and the structure of the error term in the model. One widely used model that permits accounting for the autocorrelation in the observed data is the Autoregressive

97Moving Average Model of order p and q (ARMA(p,q)). The standard version of the ARMA(p,q) model has a linear kernel and the general form where Y1(t) might denote, for example, lane offset at time t; e(t) is a random error term, often distributed as a normal ran- dom variable; and X(t) is a vector of explanatory variables. The coefficients ai, bj, and β are unknown and must be esti- mated from the data. The model includes p lagged terms for the autoregressive part of the model, and q lagged terms for the moving average portion of the model. As an example, consider a specific driver from the available data set. An ARMA(3,2) model was fit to the continuous observations obtained over 102.5 s in Trip 2 of Driver 6. The Y t a Y t a Y t p e t b e tp1 1 1 1 11 1( ) = −( )+ + −( )+ ( )+ −( ). . . + + −( )+ ′ ( ). . . b e t q X tq β ( . )6 12coefficients p = 3 and q = 2 and the two covariates were cho- sen using the AIC. Suppose that we wish to predict the location of the vehicle at time = t + 1, given information about the location, lateral speed, and lateral acceleration of the vehicle at time = t and earlier. In this example, we only use time-dependent covari- ates. The example that follows presents a more general model, where other types of covariates, such as shoulder surface and shoulder width, are also included. The model was fit using the statistical software R. Table 6.8 shows the estimated model parameters and their standard errors. We note, for example, that lateral speed at time t is neg- atively and significantly associated with lane offset at time t + 1, and the reverse is true for lateral acceleration. These two explan- atory variables appear to be good predictors of lane offset.Table 6.8. Estimates of the Parameters in the ARMA(3,2) Model Based on the Data Collected During 102.5 s During the Second Trip of Driver 6 in the Data Set Coefficients ar1 ar2 ar3 ma1 ma2 Intercept LATERALSPE (t  1) AY(t  1) 2.8021 2.6402 0.8379 −1.8103 0.9189 0.2742 −1.1262 2.0016 SE 0.0263 0.0524 0.0262 0.0192 0.0167 0.2382 0.1481 0.8626 σ2 estimated as 0.000323: log likelihood = 2660.57, AIC = −5303.15Figure 6.9 shows two graphs. On the top panel, the stan- dardized residuals over time computed from the model areStandardized Residuals Time 0 200 400 600 800 1000 -4 -2 0 2 4 0 5 10 15 20 25 30 0. 0 0. 4 0. 8 Lag A C F ACF of Residuals Figure 6.9. Diagnostic plots for the ARMA(3,2) model fit to Driver 6.

98displayed. Because the residuals are standardized, we expect that about 99% of them will be within three standard devia- tions of their mean zero. The plot suggests that there are very few residuals that exceed the value 3 (in absolute value), so we are comfortable concluding that there appear to be no out- liers in this particular data set and with respect to this model. Further, there seems to be no obvious pattern in the residu- als, even though they are plotted in time order. This is consis- tent with the plot shown in the bottom panel of Figure 6.9. In this plot, we show the autocovariance function estimated from the estimated residuals. If the order of the model is cor- rect, then we expect to see no significant autocorrelation among residuals. From these two diagnostic plots, we con- clude that the model appears to fit the data reasonably well and that the autocorrelation and moving average structure in the model account for the correlation between observations collected over time. Because the goal was to predict the lane offset at a future time given information available now, we predict the lane off- set for this driver during this trip for the 3 s that follow the end of the trip. The predicted lane offset and the 1 standard deviation bands are shown in Figure 6.10.77600 77800 78000 78200 78400 -1 .0 0. 0 1. 0 Predict(red,dashed) up to 3 seconds TIME LA N O F F S E T Figure 6.10. Observed lane offset for Driver 6 in Trip 2 (black solid curve), one-step-ahead prediction of the 3 s beginning at the end of the trip (middle red dashed curve), and the 1 standard deviation region around the prediction (top and bottom red dashed curves).However, to obtain lane offset predictions, we used a naive approach, in that we assumed that both lateral speed and lat- eral acceleration remained fixed at the values observed at the end of the trip. We know that lateral speed and lateral accel- eration also change during the prediction period, however. The simple prediction approach can be extended to allow for evolution of the covariates over time, but to do so we must explicitly include lateral speed and lateral acceleration as vector-valued response variables and model each variable as a function of the other two. For a second example, we can use the data collected for Driver 51, whose trip included two curves. It can be antici- pated that the association between lane offset and curve will depend not only on curve characteristics such as length andradius but also on the location of the vehicle within the curve. In the original data set, we have information about whether the driver is entering a curve to the left or to the right, and we also know the length of the curve. The data set includes a vari- able called CURVE, which takes on the value 1 during all time periods in which the driver is taking a right curve, the value 2 during all time periods in which the driver takes a turn to the left, and the value 0 whenever the driver’s vehicle is on a straight road. Figure 6.11 shows the value of the variable CURVE for Driver 51. We have changed the labels to −1, 0, and 1 to denote left curve, no curve, and right curve, respectively. The three dashed curves colored red, blue, and green in the figure corre- spond to three different smooth functions that depend on the length of the curve. All three smooth representations (or sum- maries) of the curves improve the fit of the time series model relative to the model that includes the static −1, 0, 1 labels. The function drawn in red is best, at least in the AIC sense.The function that smoothes out the effect of a curve over the period during which the driver is negotiating it can per- haps be improved by including the radius of the curve (in addition to the length) in the smoothing function. Using the same ARMA(3,2) model but now with an addi- tional explanatory variable consisting of the smooth curve indicator, we fitted the indicator corresponding to the red trajectory in Figure 6.11. Table 6.9 shows the estimated model parameters and their standard errors. Note that when we include the curve indicator into the model, lagged yaw rate is no longer statistically significant. Lateral speed continues to be negatively and significantly associated with lane offset.As in the earlier example, we can explore residual plots and autocovariance plots to carry out model diagnostics. Figure 6.12 shows the time-ordered estimated residuals (top panel) and the autocovariance function for the estimated residuals (bottom panel). We see from the top panel that the proportion of standardized residuals with very high or very low values is

9939200 39400 39600 39800 40000 40200 -3 -1 1 3 Some smooth functions for CURVE ( Driver= 51 ) Time C U R V E Figure 6.11. Original curve indicator (black solid line) and three smooth functions of curve length.Table 6.9. Estimates of the Parameters in the ARMA(3,2) Model Based on the Data Collected for Driver 51 in the Data Set ar1 ar2 ar3 ma1 ma2 Intercept LATERAL SPE (t  1) AY (t  1) Smooth (Curve) 0.8558 0.9891 −0.8541 0.2144 −0.7747 −0.2199 −1.2364 −0.5456 0.9797 SE 0.0662 0.0137 0.0649 0.0833 0.0832 0.0975 0.4287 3.2924 0.2394 σ2 estimated as 0.00443: log likelihood = 1309.5, AIC = −2599Standardized Residuals Time 0 200 400 600 800 1000 -2 0 0 10 0 5 10 15 20 25 30 0. 0 0. 4 0. 8 Lag A C F ACF of Residuals Figure 6.12. Diagnostic plots for the ARMA(3,2) model fit to Driver 51.

100negligible, and the autocovariance function in the bottom panel suggests that the autoregressive and the moving average struc- tures in the residual account for the residual time dependence. Sample Size Several methods were considered to estimate the sample size needs for conducting a time series analysis in the full-scale study. The methodology is rather complicated, and it was decided that it is beyond the scope of this report to describe the methodology. Application to Full-Scale Study For continuously driving online forecasting in the full-scale study, the research team proposes fitting a normal dynamic linear model (DLM) that permits continuous updating of the forecast distributions when new observations become avail- able. Each update can be made by optimizing some function of the observed data and the previous forecasts. Two such optimization approaches include the minimum mean square and the Bayesian (posterior distribution) criterion. If the variance is known, the Bayesian forecasting for the DLM is essentially equivalent to the Kalman filter used extensively in engineering control processes. The univariate normal DLM is sometimes known as a state-space model and includes the following: • Observation equation Yt = Ftθt + νt, with νt ∼ N1(0, Vt) • State evolution equation θt = Gtθt−1 + ωt, with ωt ∼Np(0,Wt) • Initial prior (θ0D0) ∼ N(m0,C0), where (m0,C0) fixed and Dt = {Yt, Dt−1} The model states that the underlying “state” θt evolves smoothly over time as an autoregressive process and that the observation at time t is a smooth function of the state. Coef- ficients Ft and Gt are often assumed to be constant over time, but they can also be allowed to be time dependent. When the state-space model is linear and when the two random drivers ν and ω are normally and independently dis- tributed, forecasting consists essentially of the estimation of normal conditional means at each step. The one-step forecast at each t is then obtained as follows: • Posterior at t − 1: (θt−1Dt−1) ∼ N(mt−1,Ct−1). • Prior at t: (θtDt−1) ∼ N(Gtmt−1,Rt) where Rt = GtCt−1Gt´ + Wt. • Forecast: (YtDt−1) ∼ N(Ft´Gtmt−1,Qt) where Qt = Ft´RtFt + Vt. • Posterior at t: (θtDt) ∼ N(mt,Ct) where mt = Gtmt−1 + At(Yt − Ft´Gtmt−1) and Ct = Rt − AtAt´Qt, At = RtFtQt−1 The following is an example problem of the Kalman filter from Bar-Shalom et al. 2001. This example involves trying toestimate the distance (range) between two vehicles and their relative speed (range rate). Consider X(t) = (range(t), range rate(t))´. Assuming a con- stant range rate, we have the following: • Original state equation x(k) = Fx(k − 1), where F is the sys- tem matrix. • Original measurement y(k) = Hx(k), where H is the mea- surement matrix. • True state equation x(k) = Fx(k − 1) + Gu(k − 1), where u(k − 1) is acceleration. • Observed measurement y(k) = Hx(k) + w(k), where w(k) is the measurement error. In previous continuous-time models, the explanatory variables in the additive model explained the vehicle shifts (left ← or right →) related to the variables. Another option is to connect the explanatory variables to vehicle recovery time after model prediction. Let t0 = the starting time where the 3-s prediction confidence (credible) region covers either edge of the lane. If the driver adjusts the vehicle before the ROR event happens, then t = 0. Let t1 = the time it takes the whole vehicle to return to the lane. Assume the failure time t = t1 − t0 is exponentially distrib- uted with the density function f(t) = λe−λt. The log hazard function should be modeled as h(t) = Xβ. This model answers questions like “recovery time versus road condi- tion” or “recovery time versus driver’s record.” Summary and Conclusions Several exploratory analysis methods were applied to data extracted from existing naturalistic driving studies to demon- strate ways in which lane departure research questions could be answered in the SHRP 2 full-scale study. The intent of the analyses was to demonstrate different methods that could be used to analyze the data that will result from the full-scale study. A data sampling approach developed by the SHRP 2 Safety Project S02 researchers was described, and four analysis meth- ods were presented. The four approaches included (1) a data mining approach using classification and regression tree analy- sis, (2) simple odds ratio and logistic regression, (3) logistic regression for correlated data that accounts for repeated sam- pling among observations (e.g., repeated sampling for the same driver, trip), and (4) a time series analysis. Three of these methods were used to evaluate existing nat- uralistic driving study data, and one method expanded on a varied logistic regression approach that may be better suited to the data from the full-scale study. Data were available from the UMTRI road departure crash warning (RDCW) field operation test (FOT) that contained a number of nonconflict

101lane departures and samples of normal driving. Methods 1 and 2 ([1] classification and regression tree and [2] simple odds ratio and logistic regression) evaluated the likelihood of a left- or right-side lane departure. A sample-based approach was used in the classification and regression tree analysis, and an event-based approach was used for the logistic regression. Although available sample sizes were limited, both methods resulted in similar results. Both indicated that curve radius, driver age, and type of shoulder were relevant in explaining lane departures. The logistic regression also indicated that both left- and right-side lane departures were more likely to occur at night and were less likely to occur as lane width increased. The model for left-side lane departures indicated that male drivers were more likely than female drivers to be involved in a lane departure, and the model for right-side lane departures indicated that lane departures are more likely on roadway sections with a higher density of lane departure crashes and for drivers who spend more time traveling 10 mph or more over the posted speed limit. The fourth method, time series analysis, used continuous data to develop a model to predict offset as a function of sev- eral vehicle kinematic variables. The method was developed and explained in such a way that it could be adapted to the full-scale study to include various explanatory variables, including driver behavior. This approach allows information, such as driver distraction in previous time periods, to be incorporated into the model. As indicated, the analyses presented in this chapter were exploratory, with the intent to demonstrate different analysis methods that could be used to analyze the data that will result from the full-scale study. Because the amount of data was lim- ited, the analyses in most cases yielded only preliminary results. Selecting an appropriate model for the full-scale naturalis- tic study will depend on the research questions posed and the resources that can be used to reduce data. Each approach has its advantages and limitations in terms of the full-scale natu- ralistic study. The main advantage of the classification and regression tree analysis is that it can be used to uncover pat- terns in the data that other methods may mask. The results may indicate that a variable is only relevant at a certain point (splitting value). For instance, there may only be a correlation between lane departures and curves with a radius of 500 ft or less, while no relation exists with larger curve radii. It is diffi- cult to uncover this sort of structure using other models. Tree models are also adept at revealing complex interactions be- tween variables. Each branch may have different combinations of variables, and the same variable can be present in more than one part of the tree. This complexity reveals dependenciesbetween variables and the point at which the dependency exists (Hosmer and Lemeshow, 1986). However, several dis- advantages exist for this method. A classification and regres- sion tree may result in unstable decision trees if improper modifications are made. If data have a complex structure, a classification and regression tree may not correctly model the data structure (Timofeev, 2004). A classification and regres- sion tree can also result in an overly complex tree structure and in models that are better for prediction than estimation (Hosmer and Lemeshow, 1986). Additionally, practitioners may not be as familiar with regression tree analysis as other methods, and incorporating the resulting information into decision making may be difficult. Logistic regression analysis is ideal for the naturalistic driv- ing study because normal driving data can be used to account for exposure. This is important because, historically, it has been difficult to account for driver activity under a range of situations in order to determine if one situation is overrepre- sented. While naturalistic data provides the volume of data needed to assess the representation of various types of situa- tions, researchers face the challenge of constructing meaning- ful equivalence classes to define these situations. Results of logistic regression can be expressed as odds ratios, which can easily be explained to lay persons and used by transportation agencies. One disadvantage is that researchers, when applying logis- tic regression appropriately to the full naturalistic driving study, need to specify and identify events of interest. As a result, some relationships may not be uncovered. Time series models are highly appropriate for naturalistic driving study data because they can account for dependencies between driver behaviors and other factors in time intervals. The main advantage of applying a time series analysis to nat- uralistic driving study data is that the analysis allows relation- ships between variables across time to be incorporated into the model. As a result, relationships such as driver distraction in previous time periods and probability of a red-light-running crash in a subsequent time period can be established. The biggest drawback to time series models is that they require the use of continuous (raw) data. Reducing variables not already included in the data sets at this level of data segmentation can be tremendously resource-intensive. Additionally, in the case of the example model presented above, only a few variables and a small data set were used and the model was still rather complicated. The results of time series analyses are also not common to highway agencies, and consequently it will be dif- ficult to present results in a manner than can easily be used in decision making.

Next: Chapter 7 - Summary »
Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data Get This Book
×
 Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s second Strategic Highway Research Program (SHRP 2) Report S2-S01E-RW-1: Evaluation of Data Needs, Crash Surrogates, and Analysis Methods to Address Lane Departure Research Questions Using Naturalistic Driving Study Data examines the statistical relationship between surrogate measures of collisions (conflicts, critical incidents, near collisions, or roadside encroachments) and actual collisions.

The primary objective of the work described in this report, as well as other projects conducted under the title, Development of Analysis Methods Using Recent Data, was to investigate the feasibility of using naturalistic driving study data to increase the understanding of lane departure crashes.

This publication is available only in electronic format.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!