National Academies Press: OpenBook

Application of Big Data Approaches for Traffic Incident Management (2023)

Chapter: Chapter 6 - TIM Big Data Guidelines

« Previous: Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines
Page 78
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 78
Page 79
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 79
Page 80
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 80
Page 81
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 81
Page 82
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 82
Page 83
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 83
Page 84
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 84
Page 85
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 85
Page 86
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 86
Page 87
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 87
Page 88
Suggested Citation:"Chapter 6 - TIM Big Data Guidelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 88

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

78 This chapter presents guidelines for transportation agencies and TIM programs regarding the development and implementation of TIM big data pipelines. NCHRP Research Report 904 pro- vides a basis for NCHRP Project 03-138; the report presents eight initial guidelines for agencies to explore the application of big data approaches for TIM and to begin positioning themselves for the shift to big data. (Pecheux, Pecheux, & Carrick, 2019). Readers are encouraged to review these guidelines and their associated sub-guidelines, which can be found in Chapter 6 of NCHRP Research Report 904. High-level guidelines include • Adopt a deeper and broader perspective on data use, • Collect more data, • Open and share data, • Use a common data storage environment, • Adopt cloud technologies for the storage and retrieval of data, • Manage the data differently, • Process the data, and • Open and share outcomes and products to foster data user communities. The guidelines presented in this chapter build from, enhance, and refine the big data guidelines presented in NCHRP Research Report 904. In addition, this chapter pulls from big data/modern data management guidelines presented in NCHRP Research Report 952 (Pecheux, Pecheux, Ledbetter, & Lambert, 2020), which provides guidance, tools, and a big data management framework. NCHRP Research Report 952 lays out a roadmap for transportation agencies to begin to actively shift—technically, institutionally, and culturally—toward effectively managing big data in emerging technologies. The guidance and roadmap are not specific to a particular transportation program or type of data; rather, they can be applied agencywide or to any type of program. Concepts and methodologies concerning big data management and use are introduced, along with industry best practices. Case studies are provided from transportation agencies that have navigated the implementation of big data, including challenges and successes. The guidelines in this report also add recommendations, techniques, and tips specific to devel- oping the TIM big data pipelines in this project. NCHRP Project 03-138 demonstrates various early attempts to build TIM big data pipelines from available data and highlights some of the challenges in doing so, given the current state of the data (i.e., availability, quantity, and quality). Specific recommendations, techniques, and tips are provided to handle or mitigate some of these challenges. The resulting 18 expanded, enhanced, and refined guidelines are presented across the following six categories: C H A P T E R 6 TIM Big Data Guidelines

TIM Big Data Guidelines 79 • Data acquisition and quality; • Data environment, platform, and architecture; • Data management; • Data processing, tools, and mining techniques; • Data pipeline development and operations costs; and • Data sharing. Each of the 18 guidelines also includes recommended actions that can be taken to implement the guideline. Table 24 presents an overview of the categories, TIM big data guidelines, and the associated implementation actions, and the following sections contain more information on each guideline. 6.1 Data Acquisition and Quality 1. Collect more data. – Gather as many TIM-relevant datasets as possible to build a solid foundation for TIM big data analytics. Make a case to partner agencies on the benefits of getting access to and using their data to support improved TIM. – Use extensive and detailed data, as opposed to cursory information (e.g., weather data from third-party providers versus weather data attributes from crash reports, respectively). – Augment human-collected data with machine-collected (sensor) data and other external data sources to obtain a more complete and detailed description of incidents and associated response activities. – Obtain transportation data and integrate with existing/partner data. Examples include ◾ State traffic records data (driver, vehicle, citation, and injury surveillance) and ◾ Law enforcement CAD/AVL data. – Consider opportunities to leverage emerging big data sources, including ◾ Crowdsourcing, ◾ Probe vehicles, ◾ CVs, ◾ Social media, and ◾ Videos. – Consider using commercial weather services that simplify and combine multiple weather data sources and make them available to the public for a fee in formats that are easier to consume. Continue to monitor the availability of these services for availability and costs over time. 2. Ready the data for big data analysis. – Standardize the data. ◾ Facilitate interoperability between datasets by using variable names within each dataset that are mapped to existing data standards, such as ♦ WGS 84 for coordinates; ♦ MMUCC guidance for crash reports so that crash data can be more easily integrated for multistate analyses; and ♦ MIRE guidelines for roadway and traffic data inventories, which provide a basis for a good/robust data inventory standard. ◾ Adopt/use the FHWA definition for “secondary crashes” to improve understanding and consistency in data collection. “‘Secondary Crashes’ are defined as the number of crashes beginning with the time of detection of the primary incident where the colli- sion occurs either a) within the incident scene or b) within the queue, including the opposite direction, resulting from the original incident” (Owens, Armstrong, Mitchell, & Brewster, 2009).

Category Guidelines Implementation Actions Data Acquisition and Quality 1. Collect more data. • Gather as many datasets as possible. • Use extensive and detailed data. • Augment or replace human-collected data with machine (sensor) data. • Obtain transportation data and integrate with existing/partner data. • Consider opportunities to leverage emerging big data sources. • Consider using commercial data services (e.g., for weather data), but be cautious of the long- term availability of services. 2. Ready the data for big data analysis. • Standardize the data. • Improve the quality of the data. • Use open-source systems and file formats. Data Environment, Platform, and Architecture 3. Leverage cloud infrastructure. • Clearly define the purpose of the cloud environment. • Select the desired type of cloud computing services and cloud service provider. • Store data in a cloud-based, object storage solution (i.e., data lake). 4. Centralize the data. • Store data in a common data environment (break down silos). • Migrate datasets progressively from agency silos to the data lake. • Transform data silos to make them interact with the data lake instead. • Focus on building a controlled data environment where use cases can grow and evolve. • Enable consistent APIs for access and analysis. 5. Develop a flexible and distributed data architecture. • Ensure that the chosen solutions are built on common architectures and possess effective, consistent commercial support. • Rely on cloud hosting services in combination with either cloud providers or open-source data storage services. • Follow a distributed architecture. Data Management 6. Store, organize, and prepare data for big data analyses. • Store all data “as is” (i.e., raw, unaltered, and unaggregated), including both structured and unstructured data. • Load large files/objects directly into the data lake. • Organize data using a “regular file system” structure. • Annotate the data so each file can be identified and defined easily, including its content, provenance, and quality. • Ensure that data are uniquely identifiable. • Convert large datasets into column-oriented data file formats. • Filter and organize the stored raw data for each individual use case, as necessary. • Use cryptographic hashes (alphanumeric strings, such as SHA or MD5) widely across data stores to ensure sustainability of the datasets. • Consider processed data as a disposable artifact—focus instead on preserving the raw data. • Use/convert to strong typing for consistent querying across states’ datasets. • Store metadata about the data elements in each dataset. • Adequately consider efficient archival processes (e.g., moving raw files to cold storage and preserving raw data indefinitely) and solutions (e.g., legal/policy archival requirements and business needs). Table 24. Summary of TIM big data guidelines and implementation actions.

7. Maintain data accessibility. • Use and store data in accessible, open, nonproprietary standards and file formats. • Make all data available for analysis. • Use activity data to continuously optimize data access and policies as data and usage change. • Allow analysts to access and integrate datasets. • Control data at the file and folder level. • Share/distribute data governance responsibility between a central entity and the TIM program. Data Processing, Tools, and Mining Techniques 8. Process data where they reside (in the common, cloud-based storage solution). • Move data processing software to the data on each of the servers in the cluster. • Adopt a distributed approach to data processing using cloud infrastructure. • Save analysis outputs to the same location. 9. Structure/optimize data for analysis and distributed computing. • Shape the data just enough to maximize analytical usability and integrability. • Consider directional issues with certain datasets. • Be aware of how incident-related timestamps are calculated or made available in the data. • Create time and spatial boundaries around reported incidents to increase the likelihood that associated data points are included in the analysis. • Use fuzzy tolerances to match contiguous segments in ARNOLD. • Consider how to integrate CV “driver event” data with “vehicle movement” data. • Understand potential limits of CV data given current levels of CV market penetration. Validation of these data against other datasets is crucial to determining when these data have met the acceptable threshold (which will vary based on the use case). • Filter CV data based on column attributes (e.g., “AccelerationType”) to improve the efficiency and accuracy of matching with crash records. • Check for odd date formats or geometries that are not in the most common CRSs. • Understand the lingo used in text-based data before parsing it. • Ensure that the tools selected for big data analytics can process the data where they reside and that the algorithms are designed to run on data scattered across multiple servers. 10. Support/use many different analysis tools and software. • Avoid being overly prescriptive or limiting the use of specific data analytics tools. • Encourage data users to customize their analyses using the most appropriate tools and libraries. • Consider big data analytics techniques, like natural language processing (NLP), to extract relevant information about incidents. • Adopt open-source software as a basis for big data platforms. 11. Understand the ephemeral nature and limitations of big data analytics. • Investigate practical solutions (or similar or partial solutions) that might already exist. • Adopt an interactive approach to solution development. • Constantly monitor the analytics results, and redesign the system as needed. • Conduct regular reviews of the data pipeline and dashboard tools, and update them with more efficient and effective techniques and tools. 12. Delegate analytics control to individual analysts/groups. • Make individual data users/groups responsible for developing the extract, transform, load (ETL) processes, data accuracy needs, and quality checks for each analysis. • Make individual data users/groups responsible for the development, deployment, maintenance, and retirement of the data pipelines/products they develop. (continued on next page)

Category Guidelines Implementation Actions Data Pipeline Development and Operations Costs 13. Use pay-as-you-go (or pay- for-what-you-use) cloud computing. • Consider data fluctuations in the initial stages of planning a data pipeline. • Consider alternative approaches for data pipelines to strike a balance between cost, efficiency, and responsiveness. 14. Use inexpensive cloud- storage solutions for inactive data. • Leverage archiving systems that automatically migrate data between normal and archive storage based on demand. • Move unused data to low-cost storage after a defined period of time. • Compress structured and unstructured data objects upon storage to minimize cloud costs. 15. Conduct a focused spatial/temporal search to minimize computing resources. • Evaluate whether the data are accurate enough when conducting spatial searches. Answer the following questions: ─What is the spatial accuracy of the data? ─What is the temporal accuracy of the datasets being compared? ─ Can the geofence be refined? • The search area should be kept as small as possible to limit the number of potential matches that must be compared. Data Sharing 16. Leverage cloud storage and computing scalability to share data, analyses (code), and products. • Apply the big data concept of “many eyes” (i.e., allow insights and decision-making across responder groups). • Share/customize data at various levels (i.e., storage, API, and interface). • Establish live data streams in addition to sharing historical data. • Continuously collect and review user data-sharing needs. • Share data extracts, data APIs, and even entire large datasets with external users. • Share data analysis process code. 17. Openly share data between TIM partners. • Open and share modified versions of the original data. • Obfuscate or encrypt sensitive data. • Create different versions of the datasets based on who they need to be shared with. 18. Open and share results of big data analyses through common data storage. • Use nonproprietary file formats and open web APIs for all data shared, both internally and externally. • Use open-format and open-source tools for data pipelines and products. • Share both data and code. • Use the Representative State Transfer (or REST) protocol for APIs. Table 24. (Continued).

TIM Big Data Guidelines 83 – Improve the quality of the data. Focus on controlling and maintaining data quality across every dataset. ◾ Set up quality-rating methods and metrics for each dataset. ◾ Develop dashboards and alerts to better track and control overall data quality trends. ◾ Make data quality the responsibility of each business unit, in addition to a governing entity. ◾ Conduct quality assessments based on user-defined requirements instead of require- ments predetermined by IT or others. ◾ Conduct targeted processing to correct for gaps in completeness, quality, and resolution. ◾ Use dynamic data crawling tools, dashboards, and alerts to continuously measure data quality trends across each stored dataset to better understand data quality and the needs of analysts. Adjust quality criteria and metadata as needed. ◾ Report/flag which data are erroneous or defective for specific use cases. ◾ Score and flag suspect data rather than removing them, and label/augment data with quality assessment tags. This gives analysts and researchers more flexibility and calls for more awareness of the relative veracity or trustworthiness of the data. ◾ Correct formatting issues (e.g., extra quotation marks in CSV and pipe-delimited formats) so that a variety of software tools can be used to ingest and process the data. ◾ Validate location coordinates for crashes/incidents. ◾ Conduct regular training for law enforcement to improve data quality. For example, FHWA recently developed a slide-based video training, titled “Improving Responder Safety Through Traffic Crash Reporting,” that focuses on the collection of data associated with the four TIM performance measures via crash reports (Carrick, 2023). – Use open-source systems and file formats. Proprietary systems often use proprietary file formats, which may impact how effectively they can be used and shared. These systems may also be high risk in the long term, as commercial/licensed systems are prone to obsolescence. Plan for conversion from proprietary formats when possible. 6.2 Data Environment, Platform, and Architecture Storing and organizing the data in silos (e.g., many data stores with various hardware, soft- ware, and data management methods across TIM partners) may have been sufficient for tradi- tional data analysis, but it will not support big data analysis. 3. Leverage cloud infrastructure. While storing a range of data sources in support of TIM may be possible using on-premises data systems, the cost of doing so would quickly become overwhelming, and alternatives would soon need to be identified. Cloud infrastructure was designed to offer the flexibility, scalability, and redundancy needed for big data at a lower cost than most on-premises solutions. – Clearly define the purpose of the cloud environment (specific needs versus “catch all”) to provide the best balance of cost and efficiency and to foster success. – Select the desired type of cloud computing services and cloud service provider. There are three main types of cloud computing services: Infrastructure-as-a-Service, Platforms-as- a-Service, and Software-as-a-Service (SaaS). – Store all data in a cloud-based, object storage solution (i.e., data lake). Cloud storage solutions are a kind of object storage meant to store large binary objects that can be up to several terabytes each. Cloud data storage is elastic and flexible, and it can be provisioned on demand (i.e., adjusted up or down based on a change in the raw data). ◾ If there are needs/use cases that have not been fully vetted, push associated data to the data lake to allow for the most cost-effective solution until focused use cases are defined.

84 Application of Big Data Approaches for Traffic Incident Management 4. Centralize the data in a common storage environment to facilitate the integration of data. Common data storage has the potential to transform data analysis in an organization by providing a single repository for all the organization’s data and enabling analysts to mine all the data. – Store data in a common data environment (break down silos). – Migrate datasets progressively from agency silos to the data lake. – Transform data silos to make them interact with the data lake instead. – Build a controlled data environment where use cases can grow and evolve. – Enable consistent APIs for access and analysis. 5. Develop a flexible and distributed data architecture that can apply many analytical tech- nologies to stored data of interest (instead of creating an all-inclusive data model capable of organizing every data element, which would be challenging). – Ensure that the chosen solutions are built on common architectures and possess effective, consistent commercial support. – Rely on cloud hosting services, in combination with either cloud providers or open-source data storage services, to support quick responses to fluctuations and changes to data storage needs without excessive downtime and cost increases. – Follow a distributed architecture to allow data processes to be developed, used, main- tained, and discarded without affecting processes on the system. 6.3 Data Management 6. Store, organize, and prepare data for big data analyses. – Store all data “as is” (raw, unaltered, and unaggregated), including both structured and unstructured data. This differs from traditional data warehousing approaches, which first clean the data and then structure them according to a predesignated data model (i.e., schema) before storing them in a relational database. – Load large files/objects directly into the data lake. – Organize data using a “regular file system” structure. – Annotate the data so that each file, as well as its content, provenance, and quality, can be identified and defined easily. This type of annotation is typically done using predefined organizational or nationwide standards by embedding data definitions directly within each file as metadata tags or by creating metadata files associated with specific datasets. – Ensure that data are uniquely identifiable. When dealing with big datasets, it is often dif- ficult to identify if the data are accurate or if they have been corrupted (i.e., a degraded or neglected version of a dataset). To remedy this issue, use cryptographic hashes widely across data stores to ensure sustainability of the datasets for common storage. A cryptographic hash is an alphanumeric string (e.g., SHA or MD5), generated by an algorithm, that can take a “snapshot” of the data upon storage in the common data store. A cryptographic hash that uniquely identifies the data can be generated and distributed across the dataset to ensure that the dataset has not been corrupted or manipulated. Given the volume of big datasets, the likelihood of silent (i.e., undetected) data corruption is high. – Convert large datasets (e.g., crash data) into column-oriented data file formats that are designed for efficient data storage and retrieval (e.g., Apache Parquet). – Filter and organize the stored raw data for each individual use case. – Consider processed data as a disposable artifact that can be easily recreated from the immutable raw data. – Use/convert to strong typing (to specify what data types are accepted) for consistent query- ing across states’ datasets. Examples of strong-type languages include Python, C#, Java, and Scala, while weak-type languages include JavaScript and C++. However, with strong

TIM Big Data Guidelines 85 typing, some data elements are left ignored and others are unavailable in some states. This reduces the usefulness of many data elements when doing cross-state analyses. – Store metadata about the data elements in each dataset. – Adequately consider efficient archival processes (e.g., moving raw files to cold storage and preserving raw data indefinitely) and solutions (e.g., legal/policy archival requirements and business needs). 7. Maintain data accessibility. – Use/store data in accessible, open, nonproprietary standards/file formats to maximize accessibility of the raw data. Example formats include CSV, JSON, Apache Parquet, and Apache Avro (https://avro.apache.org/). – Make all data available for analysis. – Use activity data to continuously optimize data access and policies as data and usage change. – Allow analysts to access and integrate datasets. – Control data at the file and folder level. With big data, there is no ability to control the use of data except by denying access to a file. – Share/distribute data governance responsibility between a central entity and the TIM program to optimize data access and data value extraction. 6.4 Data Processing, Tools, and Mining Techniques 8. Process the data where they reside (in the common, cloud-based storage solution). It has been typical to copy data to a new data store (e.g., a testing environment) where they can be sorted, filtered, and optimized for analysis and modeling using specific data analytics tools. After analysis and testing, the resulting datasets and models were then moved (i.e., copied) back to the production environment where the data originated. Big datasets cannot be quickly or easily moved, and it is very costly to move data to different environments to be processed. – Move data processing software to the data on each of the servers in the cluster; this is the foundation of cloud computing. – Adopt a distributed approach to data processing using cloud infrastructure to benefit from abundant and low-cost computing power. – Save analysis outputs to the same location to avoid the additional cost of moving analysis results across servers. 9. Structure/optimize data for analysis and distributed computing (data processing across multiple cloud servers at once). Analysts can provision as much memory and processing power as needed/desired without being limited by single-server capacity. – Shape the data just enough to maximize analytical usability and integrability with other stored datasets (i.e., do not preprocess or overprocess the data). – Consider directional issues with certain datasets. For example, CV latitude/longitude points may not be precise enough to snap to the proper side of the roadway, which makes it difficult to map CV data points to crash locations. – Be aware of how incident-related timestamps are calculated or made available in the data. Some of the timestamps are not available until the very end of an incident, therefore they are not reported until after an incident has been closed. Clearly identify the methodology that will be used to determine each timestamp. – Create time and spatial boundaries around reported incidents to increase the probability that associated data points are included in the analysis. Imprecise data for time and loca- tion of incidents creates challenges in linking incidents with other datasets. – Use fuzzy tolerances to match contiguous segments in ARNOLD. (ARNOLD is missing topology as a possible measure for connecting segments.)

86 Application of Big Data Approaches for Traffic Incident Management – Consider how to integrate CV “driver event” data with “vehicle movement” data. For security purposes, there is no unique identifier available to join the two datasets. How- ever, the two datasets capture different attributes of vehicle trips, and together they would provide more complete trip information. – Filter CV data based on column attributes to improve the efficiency and accuracy of matching with crash records. Sometimes, this additional filtering can help improve matching between the CV data and crash records. The driver event dataset, for example, can be further filtered by “AccelerationType” to keep only the driver events that are coded as either hard braking or hard accelerating in order to simplify the matching process. – Check for odd, dated, or nonstandard data formats. ◾ Check for odd date formats or geometries that are not in the most common CRSs, such as the North American Datum of 1983 (NAD 83, https://geodesy.noaa.gov/datums/ horizontal/north-american-datum-1983.shtml) and WGS 84. ◾ Re-project data from the NAD 83 CRS to the WGS 84 CRS so that the data can be merged with other datasets. While NAD 83 is the national georeferencing reference system used by most federal and state agencies in the United States for most data types, including CAD data, WGS 84 is the default coordinate system used for GPS. ◾ Re-encode Windows-1252 (or CP-1252) single-byte character encoding, which is common for crash data, to UTF-8 (a variable-length character encoding in Unicode Transformation Format) for easier processing. ◾ For example, in the case of the CHP CAD XML data feed, the data are published in XML format and use a different XML standard than the strict XML document standard. To be parsed by common XML tools and loaded into more easily managed formats, like JSON, that are usable by modern data analysis tools, the CHP CAD XML data require additional text processing to make them adhere to the strict XML standard. – Understand the lingo used in text-based data before parsing it. For example, while the most common status updates in CAD are reported as 10-codes or 11-codes, using these codes is not sufficient because in the incident updates, police lingo is often used instead. – Ensure that the tools selected for big data analytics can process the data where they reside and that the algorithms are designed to run on data scattered across multiple servers. 10. Support/use many different analysis tools and software, including open-source, propri- etary, and cloud services, to meet the needs of individual business units. – Avoid being overly prescriptive or limiting the use of specific data analytics tools. Use many, varied tools to meet the needs of individual use cases. – Adopt open-source software as a basis for big data platforms. Open-source software can help agencies avoid vendor lock-in (dependence on a particular vendor) and up-front licensing costs because it uses open standards and offers agencies more flexibility (e.g., quicker and less expensive to fix, expand on, and customize). However, open-source soft- ware may require specific skills/expertise to operate and maintain. – Encourage data users to customize their analyses using the most appropriate tools and libraries. – Consider big data analytics techniques, like natural language processing (NLP), to extract relevant information about incidents. For example, incidents in CAD data may be missing explicit TIM timestamps that could be inferred using an NLP analysis of the incident status updates, along with the times they were posted. While this approach may not be ideal for real-time data processing, it would allow more value to be extracted from CAD data. 11. Understand the ephemeral nature and limitations of big data analytics. – Investigate practical solutions that might already exist; do not start from scratch or attempt to reinvent the wheel. – Adopt an interactive approach to solution development instead of a set-it-and-forget-it approach that assumes the analytical solution will perform well for years to come.

TIM Big Data Guidelines 87 – Constantly monitor the analytics results, and redesign the system to optimize perfor- mance and quality. – Conduct regular reviews of the data pipeline and dashboard tools, and update them with more efficient and effective techniques and tools. 12. Delegate analytics control to individual analysts/groups. – IT is not in charge of preparing and maintaining epurated (i.e., purified) datasets for all analyses. Allow data users to modify and customize their own data schema as needed. Do not restrict query design to structured data or a specific data model; any type of data and analysis is theoretically possible. – Make individual users/groups responsible for developing the ETL processes, data accu- racy needs, and quality checks for each analysis. – Make individual users/groups responsible for the development, deployment, mainte- nance, and retirement of the data pipelines/products they develop. The commoditization of data and data analysis tools has fostered the adoption of self-service data preparation and analysis; a variety of end users, from novices to experts, can perform data tasks using a wide range of tools. 6.5 Data Pipeline Development and Operations Costs 13. Use pay-as-you-go (or pay-for-what-you-use) cloud-based computing. – Consider data fluctuations in the initial stages of planning a data pipeline to understand the peaks and resource allocation thresholds that may be needed. Pay-as-you-go SaaS can help to limit the cost incurred when copious amounts of data need to be analyzed for a brief period. – Consider alternatives for data pipelines to strike a balance between cost, efficiency, and responsiveness. Transactional costs in a cloud environment—or the processes to run spe- cific functions—can be refined, which can result in significant cost savings. 14. Use inexpensive cloud-storage solutions for inactive data. – Leverage archiving systems that automatically migrate data between normal and archive storage based on demand. – Move unused data to low-cost storage after an established period. These data can be moved back to normal storage quickly for use when needed. – Compress structured and unstructured data objects upon storage to limit their impact on cloud costs. 15. Conduct a focused spatial/temporal search to minimize computing resources. Geofencing of disparate datasets, those that join multiple sources that do not have a common data element, can be a significant cost factor in the cloud. Evaluate whether the data are accurate enough when conducting spatial searches. The search area should be kept as small as possible to limit the number of potential matches that must be compared. Answer the following questions to determine the accuracy of the data and the size of the geofence that may be required: – What is the spatial accuracy of the data? The lower the spatial accuracy of the data or the wider the coverage area, the larger the geofence or geohash will need to be, which can increase the cost of spatial analysis. – What is the temporal accuracy of the datasets being compared? Does the dataset contain date and time data formatted in a way that corresponds to the event being matched? The amount of data that will be required for processing will increase as this time index increases, so keeping this envelope (i.e., window of time needed to match events from two different data sources) small requires fewer resources. – Can the geofence be refined? Understanding how to design the necessary geofence or geohash area can significantly reduce the time and resources needed. For example, searching by a bounding box or rectangular area will result in different areas and resource requirements compared to searching by a defined distance around a roadway.

88 Application of Big Data Approaches for Traffic Incident Management 6.6 Data Sharing Opening and sharing data helps to build a data culture across an organization by increasing transparency and accountability; developing trust, credibility, and reputation; promoting progress and innovation; and encouraging public education and community engagement. 16. Leverage cloud storage and computing scalability to share data, analyses (code), and products. The low cost and access capability of the cloud allows for data to be made available to a wide range of potential users while still retaining control over access and the amount of resources used. The cloud enables large and complex datasets to be searched and analyzed in many ways by many users (internal and external) without over-expending resources or costs. – Apply the big data concept of “many eyes” to allow insights and decision-making across responder groups. – Share/customize data at various levels: ◾ At the storage level—read-only access to stored files, ◾ At the API level—programmatic access to stored files, and ◾ At a user-interface level—stored files can be downloaded or explored using a browser or other programs. – Establish live data streams in addition to sharing historical data. – Continuously collect and review user data-sharing needs. Understanding user data needs and behaviors can help to update how data are shared on the system. – Share data extracts, data APIs, and even entire large datasets with external users by pro- viding access to the data directly on the cloud environment, where users can search and analyze data at their own cost. – Share data analysis process code. 17. Openly share data between TIM partners. – Open and share modified versions of the original data to allow use of the data without sensitive information. – Obfuscate or encrypt sensitive data so they are only accessible to those users with per- missions for decryption. – Create different versions of the datasets based on who they need to be shared with. 18. Open and share results of big data analyses through common data storage. As results are reviewed and analyses are recreated by other members of the community, over time there will be better outcomes as successes, flaws, errors, or previously undetected patterns emerge. Previously unexplored ways to leverage the data are more likely to be discovered by a broad community than a small number of experts. – Use nonproprietary file formats and APIs for all data shared, both internally and exter- nally. Open file formats and APIs are often included in vendor solutions but not enabled by default. – Use open-format and open-source tools for data pipelines and products so that they can be packaged, shared, and run within other environments. – Share data and code to enable anyone with the adequate resources to quickly replicate an analysis on the cloud. Sharing data analytics processes in the cloud is much easier and less costly than traditional data-sharing approaches. Sharing of data and code allows other agencies, institutions, and universities to review, validate, and improve results. – Use the Representative State Transfer (or REST) protocol for APIs, which is simpler, more flexible, faster, and less expensive than XML. (In the context of large, varied datasets, XML is too strict, difficult to change, and verbose.) Data are exchanged using more flexible file formats, such as JSON, that are easily read by humans.

Next: Chapter 7 - Conclusions and Recommendations »
Application of Big Data Approaches for Traffic Incident Management Get This Book
×
 Application of Big Data Approaches for Traffic Incident Management
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Big data is evolving and maturing rapidly, and much attention has been focused on the opportunities that big data may provide state departments of transportation (DOTs) in managing their transportation networks. Using big data could help state and local transportation officials achieve system reliability and safety goals, among others. However, challenges for DOTs include how to use the data and in what situations, such as how and when to access data, identify staff resources to prepare and maintain data, or integrate data into existing or new tools for analysis.

NCHRP Research Report 1071: Application of Big Data Approaches for Traffic Incident Management, from TRB's National Cooperative Highway Research Program, applies the guidelines presented in NCHRP Research Report 904: Leveraging Big Data to Improve Traffic Incident Management to validate the feasibility and value of the big data approach for Traffic Incident Management (TIM) among transportation and other responder agencies.

Supplemental to the report are Appendix A through Appendix P, which detail findings from traditional and big data sources for the TIM use cases; a PowerPoint presentation of the research results; and an Implementation Memo.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!