National Academies Press: OpenBook

Application of Big Data Approaches for Traffic Incident Management (2023)

Chapter: Chapter 3 - Datasets and Data Quality

« Previous: Chapter 2 - Gather Information and Data and Define Use Cases
Page 8
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 8
Page 9
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 9
Page 10
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 10
Page 11
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 11
Page 12
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 12
Page 13
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 13
Page 14
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 14
Page 15
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 15
Page 16
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 16
Page 17
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 17
Page 18
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 18
Page 19
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 19
Page 20
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 20
Page 21
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 21
Page 22
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 22
Page 23
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 23
Page 24
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 24
Page 25
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 25
Page 26
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 26
Page 27
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 27
Page 28
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 28
Page 29
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 29
Page 30
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 30
Page 31
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 31
Page 32
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 32
Page 33
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 33
Page 34
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 34
Page 35
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 35
Page 36
Suggested Citation:"Chapter 3 - Datasets and Data Quality." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 36

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

8 The existence of data alone does not ensure that accurate analyses can be performed or that the data can guide effective decisions. Quality data are essential to clearly identify operational behaviors and to make effective operational decisions. NCHRP Research Report 904 identifies a host of quality issues across the TIM-relevant data sources that were assessed. For example, ATMS data assessments—conducted as part of relevant previous efforts—found a variety of quality issues, some of which may be due to how operators use the system as opposed to how the systems were designed to be used. The result is that data need to be carefully assessed before they are used to report performance and other measures. As part of NCHRP Project 03-138, the research team conducted a comprehensive assessment of the quality of data collected for the big data use cases and pipelines. In traditional data mining, data are cleaned prior to analysis, and “bad” data are often discarded. In the context of big data, data are often more voluminous, complex, unstructured, and “fuzzy,” making it difficult to assess what data should be kept and what data should be discarded. To circumvent this difficulty, data quality is handled differently in big data systems. Rather than removing “bad” data, as soon as the data are stored in the system, a dedicated data enrichment process considers multiple aspects of the data and tags each data element with quality labels and scores. This process allows for data quality to be monitored over time and to identify sudden changes in data quality. This approach treats bad or lesser quality data, which are often discarded in traditional systems, as a source of insight into their veracity, and allows analysts to consider ways to improve the data rather than simply discarding them. 3.1 Approach Data quality is typically assessed across six aspects or dimensions, which are illustrated in Figure 1. This section includes a description of each data quality dimension, examples of how each dimension could be implemented, and sample questions to ask when assessing the data. Completeness is defined as expected comprehensiveness. If the data meet the expectations of subsequent analyses, data can be complete even if some data are missing. An example of com- pleteness would be crash data that contain accurate date, time, and geolocation for most crashes but are missing weather information for half of the crashes. While there are some missing data elements, the data can be considered complete because the missing weather data can be recreated by querying historical weather datasets, such as the National Weather Service (NWS). Questions to ask when assessing the data: • Is all the requisite information available? • Do any data values have missing elements? C H A P T E R 3 Datasets and Data Quality

Datasets and Data Quality 9 • How are the data distributed geographically? • How are the data distributed across time? • Can missing data be recreated easily? Timeliness refers to whether information is available when it is expected and needed. Timeliness needs vary widely based on the intended use of the data, and they can vary from a few seconds to a few months. Timeliness of data is important for TIM when detecting traffic condition anoma- lies, dispatching responders, and communicating incident details to responding agencies. Questions to ask when assessing the data: • How often are data refreshed? • How long does it take for the data to be published or communicated? • How are the data communicated? • How often should the data feed be checked? • What data integrity processes are conducted before the data are published or communicated? Consistency means that data are published the same way across their entire history and geo- graphical coverage. System failures and limited IT resources can have unintended effects on data. It is not uncommon for datasets that were reported consistently at a five-minute interval to be suddenly reported at a more aggregate level (e.g., hourly) to preserve the resources of the system that provides the data. These changes may appear insignificant to the system administra- tor trying to keep the data feeds live, but they can drastically change the analytical potential of the data by reducing granularity and damping variation and patterns that could be essential to decision-making. Questions to ask when assessing the data: • Are data values the same across the datasets? • Are there any distinct occurrences of the same data instances that provide conflicting information? Conformity means the data follow the same set of standard data definitions—like type, size, and format—across their history and geographical coverage. Maintaining conformance to spe- cific formats is critical when performing analyses at a state or national level. System updates or operator errors can create nonconforming data; it is not uncommon for some locations, times, or incident classification information to be expressed differently in some data than in the rest of the dataset. Software updates can sometimes swap latitude and longitude, thereby mislocating incidents. Similarly, the month and day associated with incident response timestamps may be switched, creating incident responses that appear to have taken days. Also, the lack of training can lead operators to enter custom incident descriptions where a known standard could be used, such as the Model Minimum Uniform Crash Criteria (MMUCC) (NHTSA, 2017). Completeness Data Quality Consistency Accuracy Conformity Timeliness Integrability Figure 1. Six dimensions of data quality.

10 Application of Big Data Approaches for Traffic Incident Management Questions to ask when assessing the data: • Do any data values comply with the specified formats? • If so, do all the data values comply with those formats? • Which records fail to comply with the standard? • Can the failing records be corrected? Accuracy is the degree to which data correctly reflect the real-world events being described. A lack of accuracy can seriously impact operational and advanced analytics applications. Inaccura- cies in TIM data are often a result of human-entered data, such as by TMC operators into ATMS and law enforcement officers onto the crash report, as opposed to automatically generated data. Questions asked when assessing the data: • Do data objects accurately represent the “real-world” values they are expected to model? • Are there incorrect spellings or measures, or even untimely or noncurrent data? Integrability is the ability of the data to be easily integrated with other datasets in a data envi- ronment (i.e., that share a common data element). An inability to link data to related records may drastically reduce the value of the data. For example, incident and incident response data collected using different software across multiple agencies (e.g., transportation ATMS, law enforcement crash reporting systems, and law enforcement CAD systems) should allow inci- dent data in one system to be easily linked to the same incident in a different system. This can be accomplished by using a common incident ID across each system and agency, but often such requirements are not considered and agency systems are designed independently. This creates a need for more complex geo-temporal analytics when attempting to merge disparate datasets in order to obtain more incident details, which often results in a significant loss of data. Questions asked when assessing the data: • Are there any common data elements across related datasets? • Are there any unique identifiers common to related datasets? • Are similar data elements expressed using the same format or standards? • Are the data elements expressed at a similar level of aggregation? • Can intermediary datasets be used to connect disparate datasets? The research team applied all six quality dimensions to each of the identified datasets to generate data quality tags for each of the records. Conventional descriptive statistics were also used in a big data context to explore aspects of each dataset, and summaries and visualizations were generated to represent the inherent characteristics of each dataset. Finally, the existing gaps, barriers, and limitations of the data were identified as they related to their potential use in the development of the big data use cases and associated data pipelines. 3.2 Results This section presents the results of the assessment of datasets relevant to the identified use cases. The results are presented in terms of the six dimensions of data quality previously described. The following datasets and data sources were assessed: 1. Traffic incident data. – Traffic crash report. – ATMS/integrated ATMS-CAD. – CAD. – SSP. – Free navigation app.

Datasets and Data Quality 11 2. Traffic data. – Department of transportation (DOT) intelligent transportation systems (ITS) fixed sensors. – Probe vehicle data. 3. Location reference data. – Roadway inventory. – LRS. – Third-party road network. 4. SharedStreets Referencing System/OpenStreetMap (Open Transport Partnership, 2020). 5. All Road Network of Linear Referenced Data (ARNOLD) (FHWA, 2014). 6. Weather data. – Meteorological Assimilation Data Ingest System (MADIS) (NOAA, 2018). – Road weather data/Weather Data Environment (WxDE) (FHWA, n.d.-b). – Third-party weather application programming interface (API). 7. Third-party CV data. 3.2.1 Traffic Incident Data Traffic incident data included in this assessment consist of crash data, ATMS/integrated ATMS-CAD data, CAD data, SSP data, and free navigation app data. A summary of the assessment outcomes of each of these types of data is provided in the following subsections. 3.2.1.1 Crash Data Assessment Overview of the Crash Data Collected and Assessed Crash reports are completed by law enforcement officers and compiled, digitized, and main- tained by various state agencies (including DOTs, departments of public safety, and others). Most states divide the data into information about the crashes themselves, information about the persons involved in the crashes, and information about the vehicles involved in the crashes. These datasets include information about various characteristics of each crash, as well as possible contributing factors. Recorded crash characteristics vary from state to state. Crash data are recorded manually, and some data elements are prone to error or sometimes missing altogether. Recognizing the need for secondary-crash data and timestamps on the TIM timeline for the use cases identified for this project, the team gathered crash data with these data elements from states. As such, the team requested statewide crash data from eleven states and received said data from nine states. Date ranges for the crash data varied by state but included at least three years of data. Figure 2 shows the number of crashes across the states and across years. Data from Florida and Colorado date back the farthest, followed by Tennessee, Arizona, and Ohio. Not as many crashes were available from Wyoming, Maine, Nevada, and Utah due to fewer years of data being shared and the rural nature of these states. Crash Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 1. Detailed findings of the crash data assessment are described in Appendix A posted with this report. (Appendix A through Appendix P may be found by searching on the National Academies Press website, nap.nationalacademies.org, for NCHRP Research Report 1071.) 3.2.1.2 ATMS Data Assessment Overview of the ATMS Data Collected and Assessed Advanced traffic management systems (ATMSs) integrate and process real-time traffic data from ITS field devices (e.g., cameras, speed sensors) to help operators enhance vehicle traffic

12 Application of Big Data Approaches for Traffic Incident Management flow and safety through improved traffic incident detection and clearance, traveler information, and so on. Operators at a TMC and field personnel can also input information on incidents, maintenance, and work zones, among others, into the ATMS. Some transportation agencies have worked with their law enforcement partners to integrate their ATMS with public safety CAD systems to streamline incident detection, communications, and response. The data collected typically include incident location, timestamps, type and severity, number of lanes blocked, and other attributes. To support the selected use cases, the team gathered and assessed ATMS data from several systems, including Tennessee Department of Transportation’s (TDOT’s) Locate IM system, Minnesota Department of Transportation’s (MnDOT’s) Intelligent Roadway Information System (IRIS), and Utah Department of Transportation’s (UDOT’s) TransSuite system. Figure 3 shows the locations of ATMS incidents in the Minnesota and Tennessee datasets. The team was unable to map Utah’s ATMS data because they were encoded using an unknown coordinate referencing system (CRS). The number of incidents available per year for Minnesota, Tennessee, and Utah is shown in Figure 4. There were many more incidents in TDOT’s Locate IM than there were in UDOT’s TransSuite, and the incidents from MnDOT’s IRIS were only those that the team archived from the feed provided by MnDOT between February 9 and April 21, 2021 (i.e., no historical data were provided). It is interesting to note the distribution of incident locations across Minnesota and Tennessee. MnDOT’s ATMS is integrated with the Minnesota State Patrol, thus incident loca- tions across the state are provided. TDOT’s ATMS data shows good coverage in the northeastern part of the state, but only along interstate highways throughout the rest of the state. ATMS Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 2. Detailed findings of the ATMS data assessment can be found in Appendix B posted with this report. 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Year 1,200 1,100 1,000 900 800 700 600 500 400 300 200 100 0 N um be r o f C ra sh es (t ho us an ds ) Arizona Colorado Florida Maine Nevada Ohio Tennessee Utah Wyoming State Figure 2. Crash records per year per state.

Datasets and Data Quality 13 Challenges/ limitations • Human data collection contributes to quality issues (e.g., missing data, erroneous data, commas within cells, free-text fields, proper nouns). • Inconsistency of data across states makes data conflation challenging and leads to missing data. • Timeliness of data can be an issue (i.e., data not made available for months/year after collection). • Discrepancies within dataset (e.g., inconsistent formatting, varying levels of completeness between agencies, subjectivity). • Discrepancies between datasets (e.g., different data elements and attributes, different data types between similar fields, different collection processes). • Location coordinates can be inaccurate (e.g., inverted). Consequences for use in TIM big data use cases • Not all crashes are reportable (e.g., missing minor crashes, bias in TIM performance measures). • Overall small number of secondary crashes found in data. While this is the largest set of secondary crashes ever assembled for analysis, it does not qualify as “big data” from the standpoint of data volume. o Less than 40 percent of state crash forms include the MMUCC secondary-crash data element, which sometimes is not mandatory for law enforcement officers to complete. o Out of over 10 million total crashes, only about 50,000 (0.5 percent) were flagged as “secondary.” o Even among the 50,000 crashes flagged as secondary, only 30 percent could be verified using a spatial-temporal analysis. o The subjectivity of the secondary-crash definition makes secondary-crash data challenging to collect, which can impact the accuracy/veracity of the data collected. o Many of the secondary crashes had the same time and GPS location as the primary crash, which limited the ability to analyze distribution of a) time and b) distance between primary and secondary crashes. • Quantity and quality of data on TIM clearance times is lacking (to support use cases associated with TIM performance): o Less than 30 percent of states include the roadway clearance time (RCT) MMUCC data element on crash forms. o The data gathered contained only about 2.3 million crashes with values for RCT and/or incident clearance time (ICT). o Rounding crash start time, roadway clearance time, and/or incident clearance time leads to inaccurate analyses of RCTs and ICTs. o A “0” RCT where there was never a lane blocked due to a crash artificially decreases the average RCT in the data, making them less accurate. In some states, there is no indication if a lane was blocked, which makes these difficult to tease out. • Missing data and data gaps (due to combining unstandardized data across states) limit the ability to perform big data analyses (e.g., cluster analysis, predictive analysis) on the data. • Extensive preparation is needed to standardize the data across states. • Despite a good sample size, there are sizable gaps in data and attributes; if provided, this information would enhance analysis or complete missing variables. Recommendations • Work to standardize crash data (e.g., use MMUCC guideline) to improve multistate crash analyses (e.g., secondary crashes). • For easier processing, code/re-encode crash data files to UTF-8 (a variable-width character encoding in Unicode Transformation Format) as opposed to CP-1252 encoding (a single-byte character encoding used by default in the legacy components of Microsoft Windows). • Fix formatting errors in comma-separated values (CSV) or pipe-delimited formats (e.g., extra quotation marks) so that software tools can ingest and process the data. • Convert data into new formats and deploy them to a cloud environment for easier querying and analysis. • Plot RCT and ICT distributions to determine where rounding issues could impact their analysis. • Store crash data elements for each analysis period (e.g., year) to assist with multistate analyses. • Validate and correct, where necessary, crash location coordinates. Table 1. Summary of crash data assessment.

Source: © 2021 Mapbox; © OpenStreetMap. Figure 3. Geographic representation of ATMS incidents in Minnesota (top) and Tennessee (bottom). 52,399 328,985 319,327 367,055 369,017 374,188 279,208 26,527 MnDOT IRIS Data TDOT Locate IM Data UDOT TransSuite Data Year 450,000 400,000 350,000 300,000 250,000 200,000 150,000 100,000 50,000 0 20 09 20 10 20 11 20 12 20 13 20 14 20 15 20 16 20 17 20 18 20 19 20 20 20 21 393,851 376,069 375,976 388,206 4 8 2,504 9,303 9,421 10,964 14,006 13,704 13,216 12,665 15 N um be r of C ra sh es Note: UDOT had 15 crashes in 2009, 4 crashes in 2010, and 8 crashes in 2011. Figure 4. ATMS data per state per year from MnDOT IRIS, TDOT Locate IM, and UDOT TransSuite.

Datasets and Data Quality 15 3.2.1.3 CAD Data Assessment Overview of the CAD Data Collected and Assessed State and local law enforcement agencies, fire departments, emergency medical services (EMS) agencies, and 911 centers rely on computer-aided dispatch (CAD) systems to log information regarding incidents. These systems are usually based on a proprietary command center software. CAD systems allow operators to record and prioritize incident calls, dispatch responder per- sonnel, and identify the status and location of responders in the field. The data stored in CAD systems are typically structured in the form of communication logs indexed by date and time, with fields for free text. The team had access to public CAD data published by the California Highway Patrol (CHP) between July 1, 2018, and February 19, 2019. The data included 356,939 traffic-related events ranging from collisions to vehicle fires to spilled materials to animal hazards. Figure 5 shows a map of these incidents across California during this time (FHWA, 2019a). The color of each point indicates whether the incident contained sufficient data to allow the calculation of one TIM performance measure, both measures, or neither. CAD Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 3. Detailed findings of the CAD data assessment can be found in Appen- dix C posted with this report. 3.2.1.4 SSP Data Assessment Overview of the SSP Data Collected and Assessed Data are collected by safety service patrol (SSP) program staff present at the scene of incidents. Data collected include time and location of incident, type of incident, arrival and departure times, responder and response vehicle identification, supplies expended (e.g., gas, tire patch), and the type of assistance provided (e.g., refueling, repairing tire, calling tow vehicle) using either Challenges/ limitations • Areas of coverage typically limited to urban areas and/or major roadways. • Most ATMSs are not set up to support real-time data analysis. • Free-text field entries, which can include any type of text or numerical data, lead to inconsistencies. • Matching incidents from the ATMS with historical crash data is difficult due to the lack of a common data element across systems. • Actual versus intended use of the system can lead to inaccurate/erroneous data. • Nonstandard/proprietary encoding (e.g., location CRS). Consequences for use in TIM big data use cases • Little consistency (e.g., no unified standard for column names and data types, or no standardized cell values) between/within states. Thus, integrating data across ATMSs is challenging. • Lack of TIM timestamps to calculate RCT and ICT. • Most agencies can only provide historical data in batches on a by-request basis and lack real-time feeds to share their ATMS data. Recommendations • Standardize data before combining data across systems/states. • Add common data element to ATMS to tie to other data systems (e.g., crash, CAD). • When common data elements do not exist, use alternative approaches—such as a temporal-spatial analysis—to deduplicate incidents. • Do not use obscure/nonstandard (proprietary) location referencing systems. • Move toward real-time ATMS data sharing. Table 2. Summary of ATMS data assessment.

16 Application of Big Data Approaches for Traffic Incident Management pre-established codes, keywords, or free text. Some SSP programs also request a response from the drivers/vehicles assisted in the form of a postcard survey or a request to complete an online survey with structured and unstructured data. Submitted data typically capture the quality and value of services provided. The team requested data from three state SSP programs: Maryland Department of Trans- portation’s (MDOT’s) Coordinated Highways Action Response Team (CHART), MnDOT’s Freeway Incident Response Safety Team (FIRST), and Washington State Department of Trans- portation’s (WSDOT’s) Incident Response (IR). The CHART data provided by MDOT were col- lected between January 1, 2020, and June 30, 2020; the Washington Incident Tracking System (WITS) data provided by WSDOT were collected between January 1, 2015, and December 31, 2020. UDOT’s data could not be reviewed due to exporting issues. Figure 6 shows a map of WSDOT’s WITS data. SSP Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommenda- tions, is provided in Table 4. Detailed findings of the SSP data assessment can be found in Appen- dix D posted with this report. Source: FHWA, unpublished. Developed as part of the Using Data to Improve Traffic Incident Management initiative for FHWA Every Day Counts Round 4 (EDC-4). Figure 5. California Highway Patrol incidents in CAD data feed (July 1, 2018, to February 19, 2019).

Datasets and Data Quality 17 1 NIEM is a common vocabulary that enables information exchange across diverse public and private organizations. See https://www.niem.gov/. 2 The Global Justice XML Data Model (Global JXDM) is an XML standard designed specifically for criminal justice information exchanges, providing law enforcement, public safety agencies, prosecutors, public defenders, and the judicial branch with a tool to effectively share data and information in a timely manner. See https://bja.ojp.gov/sites/g/files/xyckuh186/files/media/document /global_justice_xml_data_model_overview.pdf. 3 World Geodetic System 1984 (WGS 84) is a datum featuring coordinates that change with time. WGS 84 is defined and maintained by the U.S. National Geospatial-Intelligence Agency, https://earthinfo.nga.mil/index.php?dir=wgs84&action=wgs84. Challenges/ limitations • Data sharing is limited by potentially sensitive data. • Different standards used across responder disciplines may or may not overlap and may be customized to the culture and habits of responder groups [e.g., National Information Exchange Model (NIEM),1 Global Justice XML Data Model (Global JXDM)2]. • Additional coding is required to integrate CAD data with modern systems. • Commercial vendor datasets use different CRSs than most CAD systems, which complicates integration across systems. To integrate CAD data with these vendor datasets, the CAD data must be re-projected to fit the referencing systems of the other datasets prior to being integrated. This can be costly, especially when dealing with real-time systems. • While the most common status updates are reported as 10-codes or 11-codes, less common status updates are reported using police lingo. Consequences for use in TIM big data use cases • Missing timestamps that are important to TIM, particularly roadway clearance times. • To be parsed by common XML tools and loaded into more easily managed formats, like JavaScript Object Notation (JSON), that are usable by modern data analysis tools, additional text processing of the CHP CAD XML data was required. Recommendations • While many incidents were missing explicit TIM timestamps, a quick natural language analysis of the text for the status updates and the times at which they were posted could be used to infer some of them. While this approach is not ideal for real-time data processing, it would allow more value to be extracted from CAD data. • The CHP CAD XML data are published in XML format and use an XML standard, most likely proprietary, that is different than the strict XML document standard. Using standard/common formats is recommended. • Re-project CAD data using the World Geodetic System 1984 (WGS 84)3 to facilitate integration with other datasets. • Before effectively parsing the text associated with CAD status updates, it is essential to understand the lingo. Using 10- and 11-codes (police codes) is not sufficient. Table 3. Summary of CAD data assessment. 3.2.1.5 Free Navigation App Data Assessment Overview of Free Navigation App Data Collected and Assessed Mobile navigation platforms allow users to report traffic-related events along their routes. By combining input from all users, these platforms create crowdsourced traffic information data feeds that far exceed the current ITS device coverage of transportation agencies, affording the ability to identify more incidents, along with their duration and impact, across a wider geo- graphic expanse. Information submitted by mobile navigation app users include real-time data on incidents such as crashes, construction/work zones, police presence, road hazards, traffic jams, and more. While most of the traffic alerts are generated by user inputs to the system, conges- tion alerts are automatically derived by comparing existing and historical road condition/speed data. When traffic on a segment is traveling below the average historical speed for that time of day/week, the navigation platform classifies this as a slowdown and provides congestion alerts to users. Also captured is confirmation of this information by other users through either a “thumbs-up” or “thumbs-down” response or detailed messages.

Challenges/ limitations • Traditional, less rigorous data management (e.g., spreadsheet files, stored in shared network folders, managed manually) may lead to difficulty ingesting and analyzing content. As data file formats evolve and improve, new formats may not be retroactively applied to update previously created data files. This leads to content that is nonuniform and difficult to analyze. In some cases, retrofitting a new data format is not possible, as the historical data are less precise than the new data format requires. • There can be structure and content inconsistencies in what data were recorded for each incident (e.g., data split across multiple files with various free-text descriptions of incidents versus standardized data types versus a mix of standardized data types and free text). Variations of the same information lead to data inconsistencies. • Data may not conform to industry best practices (e.g., data types, data file/export formats, guidance/automatic checks at interface or database level to ensure consistency). • Excel or CSV files lack metadata (e.g., the referential system used for latitudes/longitudes). This requires additional documentation to correctly parse and ingest the data. • Each state creates its own taxonomy to categorize or classify incident response. This requires mapping categories to integrate data from multiple states prior to merging the datasets. The SSP data may not have identifying information to enable easy association with other datasets (such as crash reports) since the SSP data originate from entirely different systems. Consequences for use in TIM big data use cases • Data collected from paper forms or radio communication lack precise location information and are of lower quality (e.g., misspelled words, nonexistent categories, non-standardized abbreviations, custom narratives). This lower quality requires complex analysis in order to correct content and attempt to standardize the “fuzzy” content; even with additional complex analysis, the resulting content may be less precise and become less valuable. Recommendations • Consider more modern ways to collect service patrol data (e.g., CAD, mobile phone/tablet apps). These tools are becoming more relevant and capture data in real time using a more structured and strict data collection process. Table 4. Summary of SSP data assessment. Incident Type Abandoned Vehicle Fatal Crash Injury Crash Non-Injury Crash Debris Disabled Vehicle Other Police Activity Source: © 2021 Mapbox; © OpenStreetMap. Figure 6. WSDOT WITS incident response from 2015 to 2020 by incident type.

Datasets and Data Quality 19 The team was able to obtain access to real-time navigation app data feeds for California, Massachusetts, Minnesota, and Utah to support the selected use cases. In addition, the team had access to a historical nationwide navigation app dataset ranging from August 1, 2012, to February 14, 2017, as well as access to U.S. DOT’s national navigation app data archive, which contains data for more than 623 million alerts, starting in 2017. Free Navigation App Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 5. Detailed findings of the free navigation app data assessment can be found in Appendix E posted with this report. 3.2.2 Traffic Data The team obtained a variety of traffic data to augment the incident data. Primarily, traffic data included speed, volume, and occupancy data originating from DOT ITS fixed sensors and probe vehicle data collected by third-party vendors. This section provides an overview of the data obtained by the team, where/how they were obtained, and the results of the data quality assessments. 3.2.2.1 DOT ITS Fixed Sensor Data Assessment Overview of the DOT ITS Fixed Sensor Data Collected and Assessed Fixed sensor data within DOT intelligent transportation systems (ITS) originate from a variety of sensor technologies, including inductive loop detectors, magnetic sensors and detectors, video image processors, microwave radar sensors, laser radars, and passive infrared and passive acoustic array sensors. Certain detectors give direct information concerning vehicle passage and presence, while other traffic flow parameters, such as density and speed, are inferred from algorithms that interpret or analyze measured data. Data elements collected include date, time, Challenges/ limitations • The times associated with historical or real-time traffic alert updates are not accurate; more than 99 percent of collected alerts have an “update time” that is identical to the original alert “start time.” The 1 percent with an actual update time are mostly construction activities. Analyses that could lead to reliable traffic insights would require a reliable alert update time. • Inconsistent and unreliable data geocoding (e.g., blank, erroneous latitude/longitude). If traffic alerts are not mapped (latitude/longitude) with enough precision, they may be snapped to the opposite side of the roadway. This was found to be most prevalent when ARNOLD was a divided roadway. • The free navigation app data provider offers limited insights into how the quality metrics are calculated for each of its alert types, namely a “reliability measure,” a “confidence measure,” a “report rating,” and counts of “thumbs-up” responses from users. • The free navigation app data provider follows its own specification, which is simple and lacks the details found in other specifications (e.g., roadway name and type, as well as confidence and reliability index, are arbitrary). Consequences for use in TIM big data use cases • The imprecise nature of the times and locations within the free navigation app data makes it challenging to integrate these data with other incident datasets, such as crash report and ATMS data, as fuzzy matching resulted in multiple potential matches. Rules or algorithms need to be devised to make sure the correct event is identified for matching. Recommendations • Snap the traffic alerts to an existing network reference system (e.g., state LRS) using latitude/longitude instead of using road location information from the data provider. • Counts of thumbs-up are a more reliable indicator of alert veracity than the “reliability measure,” “confidence measure,” and “report rating.” • Deduplication of navigation app alerts is necessary, as the incoming alerts are repeated in each delivery without direct tracking or establishing relationships between the alerts. Table 5. Summary of free navigation app data assessment.

20 Application of Big Data Approaches for Traffic Incident Management sensor ID, roadway ID, direction, annual average daily traffic (AADT), truck AADT, volumes (vehicles per minute), speed, occupancy, and vehicle classification. DOT ITS fixed sensor data were received from California’s Performance Measurement System (PeMS) (California Department of Transportation, 2023); Florida’s Regional Integrated Transpor- tation Information System (RITIS) (University of Maryland Center for Advanced Transportation Technology, 2023); and Ohio Department of Transportation (provided as a data dump). Maps of the fixed sensor locations in California (over 18,000 locations), Florida (over 14,000 locations), and Ohio (158 locations) are shown in Figure 7. Figure 7. Fixed sensor data from California (top left), Ohio (top right), and Florida (bottom). District 1 - 581 District 2 – 1,501 District 3 - 812 District 4 – 2,073 District 5 – 2,493 District 6 – 1,212 District 7 – 1,097 FTE – 3,223 MDX – 472 OOCEA - 692 Districts and # of sensors Source: © OpenSteetMap—Basemap, used under the Creative Commons Attribution-ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org /licenses/by-sa/2.0/legalcode (no changes made). Note: FTE = Florida’s Turnpike Enterprise; MDX = Miami-Dade Expressway Authority; and OOCEA = Orlando–Orange County Expressway 3 – 1,309 4 – 3,937 5 – 504 6 – 645 7 – 4,881 8 – 2,054 10 – 1,223 11 – 1,376 12 – 2,481 Districts and # of sensors District 01 - 9 District 02 - 9 District 03 - 6 District 04 - 26 District 05 - 11 District 06 - 22 District 07 - 12 District 08 - 23 District 09 – 8 District 10 - 8 District 11 - 10 Districts and # of sensors District 12 - 14 Authority.

Datasets and Data Quality 21 Challenges/ limitations • The density of sensor locations (across state and along routes) impacts the amount of data available. • Sensors fail and are not always well maintained, which can lead to missing data. • There are data quality inconsistencies across states and within states (e.g., across districts). Consequences for use in TIM big data use cases • The California Department of Transportation detector station data, provided by PeMS, are available in raw form for the previous day, which is not timely enough for applications that require real-time or near real-time data. • The distance of the sensors from many of the crash/incident locations in all the states assessed reduces the usefulness of the data. • Even though terabytes of California sensor data are available through the PeMS portal, downloading and processing the data is difficult. Files must be batch-downloaded and moved to a data lake or other “big data” environment to be used in analyses. The same is true for states like Ohio that provided a historical dump of the data. Recommendations • Providing a live feed of the data would improve use of the data in data pipelines. • Routinely check for data quality (e.g., check whether sensors are working) and consistency issues. Table 6. Summary of DOT ITS fixed sensor data assessment. DOT ITS Fixed Sensor Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 6. Detailed findings of the DOT ITS fixed sensor data assessment can be found in Appendix F posted with this report. 3.2.2.2 Probe Vehicle Data Assessment Overview of the Probe Vehicle Data Collected and Assessed Probe vehicle data are generated by monitoring the position of individual vehicles (i.e., probes) over space and time instead of measuring characteristics of vehicles (or groups of vehicles) at a specific place and time (e.g., fixed roadway sensors). Most probe vehicle data in use by trans- portation agencies are captured through GPS-enabled mobile devices (smartphone apps). These apps track vehicle movements based on the radio signal transmitted by cell phones, and more recently based on data captured from CVs. Probe vehicle data assessed for this project came from the National Performance Measures Research Data Set (NPMRDS) (FHWA, 2022) via the RITIS platform (University of Maryland Center for Advanced Transportation Technology, 2023). The NPMRDS is provided via the RITIS platform, free of charge, to states, metropolitan planning organizations, and local agencies to support MAP-21 regulations and ongoing transportation system mobility performance measurement. The RITIS platform includes a data downloader tool within its suite of products, which allows users to manually filter and download probe vehicle data at a five-minute aggregation (the lowest temporal resolution available) and spatially by Traffic Message Channel codes (about 1/2- to 1-mile roadway segments in urban/suburban areas and 5- to 10-mile segments in rural areas) across the National Highway System (NHS). The team accessed the historical NPMRDS data through the RITIS platform for the states of interest. Figure 8 shows the road segments for the NPMRDS data based on a geographic informa- tion systems (GIS) shapefile from RITIS. Probe Vehicle Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 7. Detailed findings of the probe vehicle data assessment can be found in Appendix G posted with this report.

22 Application of Big Data Approaches for Traffic Incident Management Source: University of Maryland Center for Advanced Transportation Technology (2023). © OpenSteetMap—Basemap, used under the Creative Commons Attribution-ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org/licenses/by-sa/2.0/legalcode (no changes made). Figure 8. NPMRDS road segments available for the contiguous United States. Challenges/ limitations • While the Traffic Message Channel road network used by NPMRDS covers the nation, when querying the NPMRDS data, the number of road segments returned varies (sometimes drastically) depending on the location or state and the time interval for which the data are requested. • Many values are imputed (i.e., missing values are estimated and inserted into the dataset), because of limited data for some periods of time or locations. • The Traffic Message Channel network is coarse compared to that of ARNOLD. Some Traffic Message Channel segments simplify the road network to the point of ignoring two or more intersections and crossroads. This makes the provided speed for a segment imprecise and makes it difficult to extrapolate the speed near a particular point or along the road. The Traffic Message Channel road segments are simplified to straight lines between two intersections and can deviate several blocks away from the actual road. Consequences for use in TIM big data use cases • There were roadways/segments with crashes/incidents that were not on the NHS and therefore could not be enriched with NPMRDS data. • Neither NPMRDS nor the data provided through the third-party probe vehicle tool being assessed are readily available for big data analysis (neither have massive quantities of discrete data). With the expected limitations of download times and recording intervals, real-time analysis is not possible. • Timely automated processing of the data is not possible since the process is manual and requires multiple requests of the system to obtain the necessary data. • Due to the coarseness of the Traffic Message Channel network, it is challenging to connect some crashes to road segments in the network. Recommendations • For some segments, snapping methods that are more sophisticated than a simple range search need to be employed to connect crash/incident locations to Traffic Message Channel segments. Some manual intervention may be required to connect some crashes/incidents to Traffic Message Channel road segments. Table 7. Summary of probe vehicle data assessment.

Datasets and Data Quality 23 3.2.3 Location Reference Data The team needed access to reference data to facilitate the integration of incident, traffic, and weather data. There are many diverse types of reference data. Some of these data are maintained by the state DOT, including LRS and roadway inventory data, while others are available via third parties. FHWA developed ARNOLD to overcome the lack of a nationally endorsed or industry- wide LRS standard. This section summarizes the assessments conducted on a range of location reference data for this project. 3.2.3.1 Roadway Inventory Assessment Overview of the Roadway Inventory Data Collected and Assessed Roadway inventory datasets contain extensive information about roadway segments, including roadway characteristics—physical curvature, lane types and widths, pavement types, connected access roads, roadside descriptors, and interchange and ramp descriptors, among others. More advanced data management practices maintain these data in an integrated GIS platform to allow latitude/longitude coordinates to be mapped to the appropriate roadway and milepost. The data range from tables to computer-aided design drawings to geospatial vector data. A roadway inventory includes data elements that represent the administrative characteristics (e.g., names, route numbers, truck restrictions); roadway characteristics (e.g., signs, number and type of lanes, lane/shoulder width, pavement condition/roughness, curve information); traffic characteristics (e.g., volumes, posted speed limits); and other characteristics (e.g., adjacent land use, driveways, bridge width) of a roadway. For inventory purposes, an element must have two types of data: georeferenced geometry (i.e., location of the element in space: latitude, longitude, and altitude) and descriptive (i.e., length, width, height, and condition). Transportation agencies maintain and regularly update vast inventories of roadway elements. States submit roadway inventory data annually to FHWA as part of the Highway Performance Monitoring System (HPMS) program (FHWA, 2021b), and most states make these data readily available in a variety of common GIS and tabular formats. The team obtained roadway inventory data from Colorado, Massachusetts, and Tennessee. Figure 9 shows locations in Tennessee that have a pavement roughness value in the roadway inventory data. Source: © OpenSteetMap—Basemap, used under the Creative Commons Attribution-ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org/ licenses/by-sa/2.0/legalcode (no changes made). Figure 9. Location of “pavement roughness” records for Tennessee.

24 Application of Big Data Approaches for Traffic Incident Management Roadway Inventory Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 8. Detailed findings of the roadway inventory data assessment can be found in Appendix H posted with this report. 3.2.3.2 LRS Data Assessment Overview of the LRS Data Collected and Assessed Linear referencing is a method of spatial referencing in which the locations of physical features along a linear element are described in terms of measurements from a fixed point, such as a mile point along a road. Linear referencing system (LRS) data provided by state DOTs include the geometries and other metadata for routes throughout the state. These data are typically limited in scope to interstate, U.S., and state highways. An LRS provides a way to store and display spatial events along linear networks as tabular data; this allows disparate data to be stored and related without segmenting and subdividing the underlying centerline data or performing complex geospatial searches. All states use LRSs, and these data are available. However, the data are often not updated, therefore they are sometimes inaccurate. The team obtained LRS data from MnDOT and Massachusetts Department of Transporta- tion (MassDOT). Figure 10 shows routes included in the MnDOT LRS data. LRS Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 9. Detailed findings of the LRS data assessment can be found in Appendix I posted with this report. 3.2.3.3 Third-Party Road Network API Assessment Overview of the Third-Party Road Network API Assessed The third-party application programming interface (API) that was assessed identifies the roads on which a vehicle travels and provides additional metadata about those roads, such as road type and speed limit. The road network API allows GPS coordinates to be mapped to the geometry of a road Challenges/ limitations • Data access requests can take several months to fulfill. • Data quality, delivery, and accuracy vary widely across agencies. Even when agencies provide their GIS data via a website or online portal, often those data do not have the granular location measurements necessary to be used as a linear reference. Other agencies may provide basic inventory data online, with or without measurements, but do not publish detailed data. • Roadway inventory data are often incomplete and lack consistent data management (e.g., in one state, several versions of the same roadway segments coexist without any reliable way to identify which one is the most current). • Road segments and metadata for road segments are often missing. • Inconsistencies exist between highway and arterial metadata and local roads metadata when data come from different agencies using different standards, nomenclature, and precision requirements. Consequences for use in TIM big data use cases • Delays in updating the data reduce their value; outdated road inventory data become difficult to integrate with other commercial or public data that include more up-to-date location information. Recommendations • Conversions may be needed so that data can be used with latitude/longitude coordinates in other datasets. While data typically conform to best practices for data typing and formatting, it is important to check for odd date formats or geometries that are not in the most common CRS (e.g., WGS 84). • Standardize roadway inventory data so they can be combined with data from other states. The Model Inventory of Roadway Elements (MIRE) (FHWA, n.d.-a) is a recommended listing of roadway inventory and traffic elements critical to safety management. Table 8. Summary of roadway inventory data assessment.

Datasets and Data Quality 25 Figure 10. Geographic representation of routes in MnDOT’s LRS data. Challenges/ limitations • LRS data may be inaccurate or may not include local roads. • LRS data may not be up to date. • There are cases where the geometry of a route does not line up exactly with the road; this is more common, though not exclusively, when there are curves or sharp changes in direction. This causes problems when trying to snap points with a certain range, and the distance of the LRS offset exceeds the snap range. • There are some inconsistencies in how metadata are expressed between highways, arterials, and local roads. • LRS data may use a different CRS than the typical WGS 84 system. Consequences for use in TIM big data use cases • State LRS location data are difficult, or sometimes impossible, to match to data collected on new road segments that do not align with old ones. Recommendations • Care should be taken to ensure that the same CRS is used across datasets. Table 9. Summary of LRS data assessment. segment and an address range. This could allow for traffic incidents to be associated with com- mon road segments on the third-party map road network. The API offers two relevant services to perform this action: “Snap to Roads” and “Nearest Road.” These services help associate latitude and longitude coordinates to road segments in the provider’s ecosystem. These road segments are defined by place IDs, which can be used across the provider’s other geographical tools. While not an actual dataset, as the provider does not publish its proprietary road network, the team assessed the API to evaluate if it could provide similar data to a state roadway inven- tory and at what quality level. To assess the provider’s data, the team performed several calls to the API using known crash locations and times, and then the returned data was analyzed and compared to crash report information.

26 Application of Big Data Approaches for Traffic Incident Management Third-Party Road Network API Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 10. Detailed findings of the third-party road network API assessment can be found in Appendix J posted with this report. 3.2.4 SharedStreets Referencing System/OpenStreetMap Assessment 3.2.4.1 Overview of the SharedStreets Referencing System The SharedStreets Referencing System (Open Transport Partnership, 2020) is a global, non- proprietary system for describing streets based on the road network data available in the crowdsourcing project OpenStreetMap (OSM) (https://www.openstreetmap.org). OSM is a col- laborative, crowdsourced effort to create an editable map of the world. OSM is free and open to use with proper credits/attributions to OSM and its contributors. The SharedStreets Referencing System is the foundation of the SharedStreets Toolkit, and it is used to connect a wide range of street-linked data. SharedStreets provides an abstracted layer on top of the OSM “ways” and allows the data to be consumed in protocol buffer tiles or JSON. 3.2.4.2 SharedStreets Referencing System Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 11. Detailed findings of the SharedStreets Referencing System assessment can be found in Appendix K posted with this report. Challenges/ limitations • As the “nearest road” service does not accept a heading parameter, it may be difficult to attach a coordinate to the correct road segment in some places. • The data are mostly useful within the provider’s ecosystem, as the data return a proprietary “place ID.” The place ID does not translate well to other datasets, nor is it easy to map Traffic Message Channel codes or other road encodings to the provider’s place ID. • The team was not able to assess the completeness of the data, as queries to the API are limited for non-paying users. • How long it takes for new road segments to be reflected in the data is unknown. • The API can be expensive to use (starting at $0.01 per request). Consequences for use in TIM big data use cases • The third-party road network API was deemed ill-suited for network conflation because the output was so closely tied to the provider’s data and products. The proprietary nature of the data makes it difficult to use for conflating datasets by their locations. Recommendations • Assessment of the data was limited due to restricted access to the data; analysts interested in the potential for this third-party road network API might consider paying for access to the data to fully assess its potential to support big data use cases. Table 10. Summary of third-party road network API assessment. Challenges/ limitations • SharedStreets relies on well-defined bearings. When snapping or matching arbitrary points to SharedStreets points, whether the results are accurate is highly dependent on the accuracy of the bearing provided with the original coordinates. • The tools that SharedStreets provides do not appear to be regularly updated, particularly its “sharedstreets-js” package. The tool has some bugs that cause batched requests to fail completely. Consequences for use in TIM big data use cases • SharedStreets has a low rate of success in snapping points and segments to the OSM network. Recommendations • For SharedStreets to be used successfully, users must develop a plan to split up failed batch requests and retry them in parts or to patch the software themselves. • SharedStreets may not be the best choice for datasets that do not have accurate bearings specified (e.g., most crash reports). Table 11. Summary of SharedStreets referencing system assessment.

Datasets and Data Quality 27 3.2.5 ARNOLD Data Assessment 3.2.5.1 Overview of the ARNOLD Data Collected and Assessed On August 7, 2012, FHWA expanded the requirement for state DOTs to include all public roads in their LRSs as part of the HPMS. As previously mentioned, this requirement is referred to as the All Road Network of Linear Referenced Data (ARNOLD) (FHWA, 2014). ARNOLD is a nationwide, all-roadway network provided by FHWA and is derived from data collected by state DOTs. ARNOLD consists of locations of all roads in the United States and a limited set of road segment attribute data. ARNOLD is intended to combine two concepts commonly found in road network representation: 1) a graph representation connecting individual intersections and road segments used primarily to support routing, navigation, traffic flow analysis, and so on, and 2) an LRS representing routes as a measured set of contiguous road segments belonging to a unique route identifier (i.e., the system that has been used traditionally by DOTs). The types of information that ARNOLD is designed to provide for each of its road segments are • Road centerline geometry, • Basic road attributes (e.g., road names), • Address ranges, • LRS control, and • Network topology (to allow routing). ARNOLD is designed so that information from multiple directions can be found in its rep- resentation of road networks. The ARNOLD data are public and can be downloaded from the FHWA website in the geo-database format, including one for each state and the District of Columbia. The team reviewed ARNOLD data for Colorado, Virginia, and Utah. As an example, ARNOLD data for Utah is shown in Figure 11. 3.2.5.2 ARNOLD Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 12. Detailed findings of the ARNOLD data assessment can be found in Appendix L posted with this report. 3.2.6 Weather Data The team considered several weather data sources for development of the use cases. Weather is an important aspect of traffic incidents; therefore, enriching the incident data could help identify related weather factors and patterns associated with traffic incidents and associated response efficiencies. Historical, real-time, and forecasted weather data are available via public and private sources. The National Oceanic and Atmospheric Administration’s (NOAA’s) Meteorological Assimi- lation Data Ingest System (MADIS) (NOAA, 2018) and FHWA’s Weather Data Environment (WxDE) (FHWA, n.d.-b) provide access to weather data directly through files or APIs. Com- mercial weather data aggregators provide access to weather data through APIs, web pages, and mobile apps. The weather data available through these services vary widely, from detailed raw sensor data or model data augmented with quality metrics to more practical—yet less precise— weather data composed of standardized measures (e.g., air temperature, precipitation type and intensity, wind speed and direction) delivered by commercial APIs. The team collected and reviewed weather data from MADIS; WxDE, including road weather information system (RWIS) data; and a third-party weather API. The following subsections detail the assessments of these data sources.

28 Application of Big Data Approaches for Traffic Incident Management Source: © OpenSteetMap—Basemap, used under the Creative Commons Attribution- ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org/licenses/by-sa/2.0/ legalcode (no changes made). Figure 11. ARNOLD data for Utah. 3.2.6.1 MADIS Data Assessment Overview of the MADIS Data MADIS is maintained by the National Centers for Environmental Prediction (NCEP), which is part of NOAA’s National Weather Service (NWS). MADIS is a meteorological observational database and data delivery system that provides integrated, quality-controlled datasets to the global meteorological community. Figure 12 shows the density of MADIS Mobile Platform Environmental Data (MoPED) from October 2014 (NOAA, 2017). NCEP collects data for MADIS from NOAA data sources and non-NOAA providers, and then MADIS ingests the raw observations data, either in batches or in real time. The data are in many different formats and contain observations in a variety of units and time zones. MADIS ingests all raw data files, combines the observations from non-NOAA data providers, and integrates them with NOAA datasets by encoding them into a uniform format and converting all observations to standard observation units and timestamps. After the observations data are standardized in MADIS, they are augmented with metadata describing the data sources from which they originate, all the way to the actual weather station and sensor IDs. Figure 13 shows a breakdown of the number of records per data source provided to MADIS for a sample of data sources available on September 3, 2009.

Datasets and Data Quality 29 Challenges/ limitations • The geometric accuracy of ARNOLD is not ideal (i.e., simplified geometries and missing segments). ARNOLD is as good as the state data it depends on and is published several years after the data are collected, which affects its accuracy. • The ARNOLD geometry is generally accurate enough to be conflated to other geo-datasets with “close enough” road segments. However, integrating ARNOLD with datasets with less accurate or more simplistic road geometries will be much more difficult, as it will require additional steps to match road segment metadata (e.g., common road name, directional/flow indicator, road type, mileage/road measure) that were missing from ARNOLD at the time of the assessment. • As ARNOLD is a compilation of LRS routes, topology as a measure of connected segments is completely missing. Consequences for use in TIM big data use cases • The team considered ARNOLD as the reference data for the use cases, as it was deemed to be the best source of network data available at the time for integrating the data for pipelines. • Additional data processing was required to ensure consistency across states. The completeness and timeliness of ARNOLD data depends on state submissions. There are situations where not all fields are consistently populated across roadway types or locations, and the states update their ARNOLD data on different schedules and with varying frequency. Recommendations • ARNOLD would benefit from adding a flow direction/heading attribute to the road segments. This would avoid having to calculate the flow direction for each heading to be matched during integration efforts. • Given that topology is not included in ARNOLD, any process that relies on connecting nodes is suspect. As such, every attempt needs to be made by considering fuzzy tolerances to find subsequent segments. • ARNOLD is on the right track and has potential, but it requires more frequent updates to be relevant in a world where commercial alternatives update roadway data worldwide every month or two. • The ARNOLD road segment metadata would benefit from improved integrability with other datasets, especially with upcoming datasets such as CV and Internet of Things data. These data will be based on road networks and base maps optimized for commercial purposes rather than those of government agencies. Table 12. Summary of ARNOLD data assessment. Source: NOAA (2017). Figure 12. MADIS Telematics Data (MoPED) density in October 2014.

30 Application of Big Data Approaches for Traffic Incident Management Following the addition of traceability data to the MADIS data, the unified observations undergo a series of static and dynamic quality checks, which are also added as flags to each MADIS observation to indicate the quality from a variety of perspectives (e.g., temporal consis- tency, spatial consistency). MADIS data are then stored in the MADIS database and made avail- able either through a web map tool (shown in Figure 14) (NOAA, 2021) or through a download service that provides a bulk file and requires registration. The bulk file download uses Network Common Data Form (NetCDF), which is a set of software libraries and data file formats. MADIS Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 13. Detailed findings of the MADIS data assessment can be found in Appendix M posted with this report. 3.2.6.2 Road Weather Data/WxDE Assessment Overview of WxDE Road weather data provide information about the safety and mobility impacts of weather events on the road. Road weather data are collected at roadway locations via environmental sen- sor stations (ESS) known as road weather information systems (RWISs), and they can include atmospheric conditions (e.g., air temperature, visibility distance, wind speed), pavement conditions (e.g., temperature, condition, chemical concentration), and water level conditions (e.g., lake levels near roads). Clarus was the first attempt to standardize road weather data across states and regions as the basis for timely, accurate, and reliable weather and road condition information (https://www. its.dot.gov/research_archives/clarus/index.htm). Clarus has now become the RWIS branch of MADIS. Following the integration of Clarus with MADIS, FHWA developed a new research platform, Weather Data Environment (WxDE), which collects and shares transportation-related weather data, with a particular focus on weather data related to CV applications. The WxDE • Incorporates much of the Clarus data and functionality and augments station data with CV data and applications; Automatic Position Reporting System as a WX NETwork (APRSWXNET) MesoWest Hydrometeorological Automated Data System (HADS) Remote Automatic Weather Stations (RAWS) Ohio DOT National Water Level Observation Network (NWLON) Global Positioning System Meteorology (GPSMET) West Texas Mesonet (WT-Meso) Minnesota DOT D at a So ur ce Number of Samples (thousands) 0 50 100 150 200 250 300 350 400 450 421,601 208,560 107,230 50,251 39,111 38,428 18,679 16,619 15,378 Figure 13. Number of records per data source in MADIS data on September 3, 2009.

Datasets and Data Quality 31 Source: NOAA (2021). Figure 14. MADIS surface data website. Challenges/ limitations • The NetCDF file format and tools, and the web services available to access the data, are challenging to use. • The NetCDF format is supported by many libraries and tools created by the scientific community that make it easy to extract and filter data out of each NetCDF file. However, these tools were mostly created for programming languages used by scientists (e.g., FORTRAN, C/C++), not for the current data science and GIS toolkits, which are built around the Python and R programming languages. This makes data stored in NetCDF files challenging to extract and use with common data science and GIS tools. • NOAA provides access to MADIS in real time and archived NetCDF files via file transfer protocol (FTP). FTP is considered a legacy file-sharing service and is not ideal for sharing large data files. • While supported by scientific tools, some of the characters and data types used by MADIS are often incompatible with, or not supported by, common data analysis tools. They require that the characters be removed or replaced, and the data must be converted to other data types to be analyzed. This is time-consuming and introduces the risk of data loss or corruption. • The MADIS data are not meant to be used for real-time analysis. Out of the many data sources, while some may give data in near-real time (e.g., weather station broadcasting data every 5 to 30 minutes), others (e.g., satellites) send data several hours later. Consequences for use in TIM big data use cases • Due to the vastly different programming languages and data structures within NetCDF and those of data science and GIS toolkits, the MADIS data require extensive learning and additional programming to convert NetCDF data into a format compatible with Python or R. Recommendations • NetCDF and the data contained in the MADIS dataset surpass the average needs of transportation agencies. It would be preferable for such data to be simplified and available in a supported format. • The Open-source Project for a Network Data Access Protocol, known as OPeNDAP (OPeNDAP, 2023), has client libraries in Python and R, yet it is still a scientific tool. While OPeNDAP may be easier to use than parsing NetCDF files, it still requires additional effort to integrate MADIS data with a common data store or system. OPeNDAP is better suited for real-time or event- based data processing than for historical data analysis. Table 13. Summary of MADIS data assessment.

32 Application of Big Data Approaches for Traffic Incident Management • Collects data in real time from fixed ESS and mobile sources; • Computes value-added enhancements to these data, such as by computing quality-check values for observed data and computing inferred weather parameters from CV data (e.g., inferring precipitation based on windshield wiper activation); • Archives both collected and computed data; and • Supports subscriptions for access to near-real-time data generated by individual weather- related CV projects. The team attempted to collect and archive data from the WxDE but encountered internal server errors. Instead, the team built a smaller, historical dataset by collecting data from the WxDE real-time data feeds from four states (i.e., Utah, Ohio, Maine, and Arizona) between May 5, 2021, and May 10, 2021. A sample of the data collected is shown in Figure 15. WxDE Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommenda- tions, is provided in Table 14. Detailed findings of the WxDE data assessment can be found in Appendix N posted with this report. 3.2.6.3 Third-Party Weather API Assessment Overview of the Third-Party Weather API As was mentioned in the MADIS assessment, there are a variety of third-party data providers that repackage weather data and forecasts from government data sources such as NOAA’s NWS and EUMETNET (https://www.eumetnet.eu/) into mobile apps, web APIs, and real-time data feeds that are not as focused on scientific use and are more in line with the standards used in data science. The team selected a third-party weather data provider for this review because of its low cost and the ease of access to historical weather data. Existing, future (predicted), and past weather data were available through a mobile app and a web API, as well as a hyperlocal forecasting model capable of providing users with granular weather conditions and up-to-the-minute fore- casting at an exact GPS location. To analyze the third-party weather data, the team could not perform its assessment using the same approach that was used for datasets with copious amounts of historical data. As the data are provided as a service for specific points in space and time, the team would have had to collect data for a year or more in various locations across the country to recreate such a dataset. Instead, the team collected the data by using the API to enrich a dataset of more than eight million crashes that occurred between January 2004 and January 2021 and then analyzing the weather data returned for the crash dates, times, and locations. 4,000 3,000 2,000 1,000 0 Co un t Time Thurs 06 1AM 2AM 3AM 4AM 5AM 6AM 7AM 8AM 9AM 10AM 11AM 12PM 1PM 2PM 3PM 4PM 5PM 6PM 7PM 8PM 9PM 10PM 11PM Fri 07 Arizona DOT Maine DOT Ohio DOT Utah DOT Figure 15. WxDE data for state DOTs.

Datasets and Data Quality 33 Figure 16 shows the third-party weather records obtained to enrich crash data from Arizona, Florida, Maine, Ohio, Tennessee, Utah, and Wyoming. Third-Party Weather API Assessment—Summary The assessment of data from the weather API had positive results. The service was easy and inexpensive to use compared to its competitors, and it offered high-quality weather data. The weather API successfully provided a response for the approximately eight million requests the team made for crash times and locations across seven states and 17 years (as shown in Fig- ure 16). It should be noted that the weather API was acquired by a technology company, and its API service was terminated on March 31, 2023. However, there are similar, alternative API weather services. Detailed findings of the weather API data assessment can be found in Appen- dix O posted with this report. 3.2.7 Third-Party CV Data Assessment 3.2.7.1 Overview of the Third-Party CV Data Collected and Assessed The availability of connected vehicle (CV) data—generated from an equipped vehicle’s elec- tronic control units, Controller Access Network, and infotainment systems—is an emerging data source with many promising opportunities. With these data, detailed information about vehicle events can be explored, such as seatbelt status or speed and location of the vehicle along its trip. A third party collects these data through partnerships with various automobile manufactur- ers, with much of the data made available directly from the CVs with low latency. Two types of data were used to support one of the use cases: driver event data and vehicle movement data. Driver event data capture car events that take place during a journey, and data attributes include the geolocation of the vehicle, hard braking, airbag deployment, windshield wiper use, acceleration type, and speed changes, among others. Vehicle movement data consist of attributes such as Challenges/ limitations • WxDE server errors were encountered when downloading the data. • The WxDE data that downloaded were not consistent for each five-minute interval or for each state, which resulted in missing data. • In some states, such as Utah, historical data were unavailable. The team also discovered that certain states did not return any data in the real-time feed for over three hours, which would make the assembly of a historical dataset difficult. • Not all stations report the same data elements, and not all stations report at the same time or on the same schedule. • The WxDE may use fewer weather stations to support data samples when compared to MADIS, limiting potential data sampling (e.g., MADIS in Maine shows 18 stations, yet only 2 stations provide data to the WxDE). Consequences for use in TIM big data use cases • The WxDE data sample can only be used reliably to associate atmosphere and pavement conditions with traffic incidents that occur near weather stations, as it only contains observations from fixed weather stations. • The legacy technology stack selected for the platform is ill-suited to handle the volume of sensor data available from mobile/telematics weather sensors. • The WxDE platform cannot be used reliably to integrate weather data with other systems/datasets. Recommendations • According to the WxDE documentation, WxDE data should also contain observations from mobile/telematics weather stations. Such data would improve the geographical and temporal completeness. (Some mobile/telematics datasets are listed on the WxDE website, but none were accessible at the time of assessment.) • Updates to the WxDE are required to support historical data and real-time data feeds. At the time of assessment, neither historical data nor robust real-time data could be downloaded from the WxDE. The team developed code to set up the data feeds, yet the feeds remained unstable; some real-time data feeds did not return any data for over three hours. Downloads of historical data failed due to an internal website error. Table 14. Summary of WxDE data assessment.

34 Application of Big Data Approaches for Traffic Incident Management geolocation of the vehicle, vehicle speed, and vehicle heading. Both datasets contain timestamp information based on Universal Coordinated Time (UTC), which provides a consistent and reli- able way to understand data details. The team obtained one month of data for November 2019 in Phoenix, Arizona. The gen- eral geographic area for the data is illustrated in Figure 17. The driver event dataset contained 3,775,333 unique trips, with approximately 18 million driver events recorded. The vehicle move- ment dataset contained 7,121,051 unique trips, with 1,835,181,659 vehicle movements recorded. Vehicle speeds are captured within the vehicle movement data every three seconds once a trip started. The two datasets provide unique insight into the network; for example, vehicle speeds within the vicinity of a known eastbound crash are shown in Figure 18, with lighter shades for slower speeds and darker shades for higher speeds. The use of this time-stamped speed data allows for a detailed analysis of how speeds on the network changed before, during, and after an event occurred, providing insight into the overall impacts of the crash on the network and how long it may have taken for speeds to return to normal. 3.2.7.2 Third-Party CV Data Assessment—Summary A summary of the data assessment, including challenges, limitations, and recommendations, is provided in Table 15. Detailed findings of the third-party CV data assessment can be found in Appendix P posted with this report. Source: © 2021 Mapbox; © OpenStreetMap. Figure 16. Geographic distribution of weather data linked to crash reports.

Datasets and Data Quality 35 Source: © OpenSteetMap—Basemap, used under the Creative Commons Attribution-ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org/ licenses/by-sa/2.0/legalcode (no changes made). Figure 17. Geographic area of the CV data.

36 Application of Big Data Approaches for Traffic Incident Management 0 – 0 kph 0 – 4.6 kph 4.6 – 19.6 kph 19.9 – 43.8 kph 43.8 – 107.1 kph Crash Source: © OpenSteetMap—Basemap, used under the Creative Commons Attribution- ShareAlike 2.0 License (CC BY-SA 2.0), https://creativecommons.org/licenses/by-sa/2.0/ legalcode (no changes made). Figure 18. Speeds for CVs in event area. Challenges/ limitations • Direction challenges need to be considered, as CV event data points may not always be mapped, via latitude and longitude, with enough precision to snap to the proper side of the roadway. Data may require further analysis using vehicle movements to increase accuracy. • Difficulties in matching driver event data and vehicle movement in real time, as an immediate relationship may not be clearly established. • Capture rate will need to be considered until there is a significant increase in CV market penetration. Consequences for use TIM big data use cases • The challenge of mapping driver event data with vehicle movement data needs to be considered. The two datasets capture different attributes during vehicular trips. If looking at the two datasets together rather than separately, one would be able to see full information for the whole vehicular trip; in that case, the possibilities of identifying the correlation between CV data and crash data can be increased significantly. Although the two datasets came from the same data provider, they do not share a common identifier, so the datasets could not be connected. For the Phoenix dataset, the driver event and vehicle movement datasets could not be mapped together by trip journey IDs or timestamps. Thus, the two datasets were evaluated separately to find correlations with the crash data in this use case. • Until all vehicles are equipped to report when they have been involved in a crash, vehicle movement data will be needed to identify the location of crashes and non-crash traffic incidents. Recommendations • Despite capture rate concerns, the vehicle movement data available is detailed and complete. If used in concert with CV driver event data and/or event data from other sources, these data could provide the most efficient and complete view of the overall transportation network and may serve as the base for automated detection of traffic incidents. • Establish a consistent base roadway network that the driver event data and vehicle movement data will be processed or snapped against. The use of a base network for conflation of the data will provide access to additional analysis attributes, including roadway segmentation or roadway characteristics not inherently present in the data. • Establish a methodology to determine expected trend data as a baseline for comparison with the vehicle movement dataset. This baseline can be used through automated measures to analyze the network in real time for departures from expected conditions, which may indicate that an incident has occurred. Table 15. Summary of third-party CV data assessment.

Next: Chapter 4 - TIM Big Data Use Cases »
Application of Big Data Approaches for Traffic Incident Management Get This Book
×
 Application of Big Data Approaches for Traffic Incident Management
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Big data is evolving and maturing rapidly, and much attention has been focused on the opportunities that big data may provide state departments of transportation (DOTs) in managing their transportation networks. Using big data could help state and local transportation officials achieve system reliability and safety goals, among others. However, challenges for DOTs include how to use the data and in what situations, such as how and when to access data, identify staff resources to prepare and maintain data, or integrate data into existing or new tools for analysis.

NCHRP Research Report 1071: Application of Big Data Approaches for Traffic Incident Management, from TRB's National Cooperative Highway Research Program, applies the guidelines presented in NCHRP Research Report 904: Leveraging Big Data to Improve Traffic Incident Management to validate the feasibility and value of the big data approach for Traffic Incident Management (TIM) among transportation and other responder agencies.

Supplemental to the report are Appendix A through Appendix P, which detail findings from traditional and big data sources for the TIM use cases; a PowerPoint presentation of the research results; and an Implementation Memo.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!