National Academies Press: OpenBook

Application of Big Data Approaches for Traffic Incident Management (2023)

Chapter: Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines

« Previous: Chapter 4 - TIM Big Data Use Cases
Page 71
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 71
Page 72
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 72
Page 73
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 73
Page 74
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 74
Page 75
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 75
Page 76
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 76
Page 77
Suggested Citation:"Chapter 5 - Estimated Costs of Cloud Environments and Data Pipelines." National Academies of Sciences, Engineering, and Medicine. 2023. Application of Big Data Approaches for Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/27300.
×
Page 77

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

71   Cloud-based platforms present a change in procurement and payment of data systems, storage, and processing for transportation agencies. Cloud services—whether shared hardware, software, or information resources—are provided as a metered service over the internet. The simplest pricing model—and default option—for cloud services is the “on-demand” or “pay-per-use” model. It is important to understand the distinction between procuring cloud services and procuring traditional on-premises systems. Traditionally, agencies would scope and procure physical servers for a predetermined price based on predefined requirements. Agencies could then use these servers to run as many processes or analytics as possible, within specifications, without additional cost (excluding any expansion of the server). The on-demand pricing model for cloud services requires a complete paradigm shift, which includes both benefits and challenges for transportation agencies. There are few up-front costs for the cloud, as no servers need to be purchased. On-demand services do not require long-term commitments. Storage and processing are automatically scaled up and down based on require- ments and demand or need. This flexibility adds complexity to procurement because the cost is dynamic; each process, query, or analysis has a direct cost. Cloud service providers may also increase the price of services when demand is high and decrease prices when demand is down. As such, agencies should consider and account for these cost fluctuations prior to developing cloud environments, data pipelines, and data products. Additionally, agencies should closely monitor and manage cloud environments to balance needs, performance, and costs, and they should redesign systems as necessary to minimize variability. There are services on the cloud to help users monitor and manage their system resources and—indirectly—their costs. Cloud providers also offer a series of features to limit the risk of uncontrolled costs; they allow limits on computing resource use per month, week, or day to be implemented. They also offer guaranteed steady prices for systems that need to run continu- ously for several months or years. Overall, dynamic pricing means that cloud systems need to be monitored more closely than on-premises systems, processing and analyses need to be run at opportune times, and systems need to be designed to limit cost variability. To discuss the various costs associated with a cloud data environment and the TIM big data pipelines developed in this project, a generic cloud data environment is illustrated in Figure 43. Major components of a cloud data environment include the following, and each of these compo- nents will come at a cost to agencies: • Data sources, both internal and external; • Data storage (including storage of raw, managed, and curated datasets in a data lake), process- ing, and analysis; and • Data products, which may involve business intelligence, dashboards, reports, or APIs. C H A P T E R 5 Estimated Costs of Cloud Environments and Data Pipelines

72 Application of Big Data Approaches for Traffic Incident Management This chapter provides information on the various costs of cloud environments. The costs of data sources; data storage, processing, and analysis; and data products are discussed in the con- text of the data environments and pipelines developed for this project. Due to the on-demand pricing model of cloud services, it is not practical to assign specific numeric values; the costs are highly variable, and they may quickly become outdated. Rather, estimated cost ranges for various components of the cloud environment/data pipelines are provided. These estimated cost ranges are categorized and defined in Table 20. 5.1 Estimated Cost of Data Sources for TIM Big Data Environments and Pipelines While most of the data sources used in the TIM big data pipelines were available for no charge, two sources of data—the weather API and CV data—were purchased for a fee. Most transportation agencies are accustomed to purchasing data from third-party vendors; for example, this can include probe vehicle data, which are used by most agencies. Increasingly, agencies are beginning to evaluate the value of CV data. The small sample of CV data used for this study (i.e., one month of data from Phoenix, Arizona) was negotiated for use in the research project Figure 43. Example cloud data environment. Cost Range Definition $—Low Estimated to cost less than $500 a month. This includes basic system access and a low level of analysis or processing with more limited resources. $$—Medium Estimated to cost between $500 and $1,500 a month. This is a more robust solution that involves more on-demand data storage and analytic resources, which are used to continually conduct real-time processes. $$$—High Estimated to cost over $1,500 a month. The cost for these environments can begin to vary quickly and drastically based on dedicated resources and storage and the use of curated data (i.e., data made immediately available for analysis and processing in more costly resources, with the benefit of operating significantly faster than other cloud-based analyses). Table 20. Estimated cost ranges for components of cloud data environment.

Estimated Costs of Cloud Environments and Data Pipelines 73 only. This sample is not representative of a CV dataset that would be of interest to many trans- portation agencies; larger geographic and time windows would be desired for most applications. A weather API provides a more robust and detailed source of weather data than most agencies have available in-house. While the weather API used for the data pipelines in this project is no longer available, there are many alternative weather APIs available, each of which offers its own type of services, data, and pricing structure. The team conducted a review of several weather APIs and compiled the following as example costs of weather data: • Historical weather API—examples from two different service providers: – $150 per month includes 5,000 calls per day (one year back). – $180 per month includes 15,000 calls per day (three years back). • Pay-as-you-call API: – 1,000 API calls per day for free; $0.0015 per call over the daily limit. • Subscription plans: – There are no limits on the number of API calls; users pay for a subscription according to the actual use of the product. • Professional subscription plans: – Fixed price per month and API call limits (number of API calls per minute and number of API calls per month). A few examples from one service provider: ◾ No monthly charge—60 calls per minute; 1,000,000 calls per month. ◾ $40 per month—600 calls per minute; 10,000,000 calls per month. ◾ $180 per month—3,000 calls per minute; 100,000,000 calls per month. For a small fee, transportation agencies can have access to historical and real-time weather information to use in data pipelines. Agencies are encouraged to review the various offerings and select the service that balances requirements of the application (e.g., types of data provided, historical versus real-time data, call frequency) and cost of the service. 5.2 Estimated Cost of Data Storage, Processing, and Analysis for TIM Big Data Environments and Pipelines As shown in Figure 43, data storage, processing, and analytics are the primary components of a big data environment or pipeline. This section provides estimated cost ranges for the data storage, processing, and analytics components of the big data pipelines developed for this project. 5.2.1 Estimated Data Storage Costs An example of a data lake configuration is shown in Figure 44, which illustrates different storage zones for raw, managed, curated, transient, and archived data. • Raw data are created by data ingestion processes. These data are untouched (i.e., access limited to data administrators), retained indefinitely, and immutable. The raw data are used to regen- erate downstream datasets when needed. • Managed data are created from the raw data by data processing pipelines. They are reformatted (more open) and often serialized (flattened). The managed data are augmented by adding quality and provenance labels. • Curated data are created from the managed datasets by data processing pipelines. These data are enriched with additional information, can be considered “trusted,” and can be queried for easy consumption. Curated datasets are organized to maximize data value and delivery. They are often developed for specific use cases. Typically, curated, readily available data require more resources and processing capability, which in turn increases the cost of these datasets. • The transient data zone holds data that are in transit between zones stored in memory. This includes cached data and data used in cloud functions. Transient data need to be audited and

74 Application of Big Data Approaches for Traffic Incident Management cleaned up periodically, and they are removed as part of the data engineering process once the data are no longer needed. • The archived data zone stores aged, curated data that is not needed for quick and efficient query (although the data can be restored for this purpose). Estimated cost ranges for these types of storage (where relevant) are shown in Table 21 in terms of each of the use cases/data pipelines developed for this project. 5.2.2 Estimated Data Processing and Analysis Costs Estimated cost ranges for data processing are shown in Table 22 in terms of each of the use cases/data pipelines developed for this project. 5.3 Estimated Costs of Data Products Estimated cost ranges for data products for each of the use cases/data pipelines developed for this project are shown in Table 23. Figure 44. Cloud data storage in a data lake.

Use Case Storage Component Estimated Cost Range Description 1 Real-time curated $$ to $$$— Medium to High • The curated-data storage environment requires ongoing resources to support interaction with the data (including data queries). As such, the cost of this environment is dependent on the amount of data being stored and the resources needed for analysis. Having these resources readily available increases the cost of these data, as the data remain in a ready state. • To lower the cost and size of the data within this more resource-intensive environment, any records with a timestamp after 12 hours were automatically removed. Archival $—Low • The raw data are stored in an archival/long-term data store. The advantage of this data store is that, while it primarily meets retention and disaster recovery needs, it functions as a document store and does not allow for resource-intensive queries. As such, this data store is more cost effective, as the speed of response is not a priority. • Archival storage was also used for the output files for retention and review of past incidents that were outside of the 12-hour curated-data storage time frame. 2 Geofence streaming database $—Low • The data in this database are stored in memory for processing only. As such, the actual database storage cost is negligible because the results are stored in memory for immediate use in query. Crash document database $—Low • The crash document database stores the incoming and updated records that are continually refined by the data pipeline processes. It does not serve as a recordkeeping location; crash records are removed after each crash is closed. Archival $—Low • The updated crash documents are placed in archival storage for historical analysis of TIM performance (not intended for rapid query of the data). 3 Archival $—Low • All crash data were archived in the data lake in “raw” form (as received from the 10 states). While >10 million crashes is a large amount of data, in comparison to other use cases, this dataset is small (gigabytes of data). Curated $—Low • Even when the secondary crashes were uniformized and enriched with ARNOLD and weather data, the data were still relatively small (gigabytes) and did not need to be available for real-time processing, which should keep the cost low. 4 Managed $$—Medium • Only one storage mechanism was needed (managed dataset) because the remainder of the data for the dashboard were stored within a GIS platform (covered under data product costs). • The data packets for CV data required a moderate amount of storage. While the amount of data was large in comparison to the other use cases, the data were still manageable (gigabytes) because the time and area of analysis (one month of data from Phoenix, Arizona) were small. However, storage costs could increase significantly as the geographic area and data retention needs increase. Table 21. Estimated cost ranges of storage for the TIM big data pipelines.

76 Application of Big Data Approaches for Traffic Incident Management Use Case Component Estimated Cost Range Description 1 Data ingestion and processing $—Low • The processing functions included the ingestion and enhancements of the data (i.e., triggered events that performed the Collect, Snap, and Enrich functions, which prepared the data for use). Data were also transformed into a consistent format. • The initial data pipeline was designed to make dedicated resources available (regardless of need) to achieve minimal latency and processing of the data. • A modified pipeline, which focused on resource scalability rather than dedicated resources, effectively reduced the cost of the pipeline while achieving the desired processing/update speed. Real-time analysis and query of curated data $ to $$$— Low to High • The data needed to be continuously readied for query, analytics, and display. • While queries on the data were performed quickly and at fractional costs per request, costs could vary and rapidly change based on the number of users and requests. Costs will increase or decrease with the automatic scaling of resources based on requests. • Outside of testing and validating this pipeline, queries were kept to a minimum. This kept the cost associated with the pipeline low. However, usage of the data pipeline in a production environment would increase the cost of this pipeline. 2 Real-time data processing and analysis $$— Medium • This advanced spatial search was resource-intensive; it required a constant search and comparison across the data sources to identify spatial relationships within a geofence area. • While each step was low cost, the volume of data and records being processed through the system increased the overall cost of the data pipeline. • The use of tabular relationships and simple linking of events required little processing. During development, even as the volume of records increased across the initial pipeline, costs remained negligible until the geofence processing was enabled. 3 Historical query and analysis $—Low • The ability to perform a desktop-level analysis, even when conducted in the cloud, resulted in a low-cost option for both the data query and analytics. • Leveraging the cloud environment provided an elevated level of flexibility; resources could be managed and scaled up to perform the spatial-temporal and cluster analyses without requiring a local desktop machine or a robust server (which would have increased resources). This intensive, yet intermittent, analysis in the cloud provided a powerful environment at a low cost. 4 Simulated real- time data query and analytics $$ to $$$— Medium to High • Comparisons and relationships were rapidly performed across locations, with variations in the search parameters. As the area of the search increased, the required resources also increased. • The functions contained within the cloud environment included the analytic capability to process and enrich the data. While this includes many individual records, the project area and analysis time frame were kept small (i.e., one month of data from Phoenix, Arizona) and therefore required fewer resources. The larger the search area required to match CV data with a crash event, the higher the associated resources. Several variations were used to determine the geofence area necessary to provide an optimal balance between successful matches and needed resources. A half-mile boundary around the crash locations yielded the best results in matching the crashes with CV data. • A date/time filter was used as part of the real-time simulation to limit the number of records that needed to be queried to find successful matches, which lowered the cost because it significantly reduced the number of records that were required to be searched. Table 22. Estimated cost ranges for processing and analysis within the TIM big data pipelines.

Estimated Costs of Cloud Environments and Data Pipelines 77 Use Case Component Estimated Cost Range Description 1 GIS platform online and hosted layer $—Low • The use of a GIS platform online allowed the data to be stored and readied for immediate display on the dashboard. The platform uses a cost system based on credits. Using the platform’s credits for hosting and analysis provided an alternative option to store, query, and display the data. As an example, $500 a year can provide between 350–500 credits. The online dashboard for this use case required only about 0.4 credits per day (or about 12 credits per month). 2 Not Applicable Not Applicable • No data products were created for this use case. 3 Results of analysis $—Low • The data product for this use case included the results of the analysis of the curated dataset (i.e., same cost as the analysis component). 4 Online GIS and hosted layer $—Low • Using the GIS platform as a hosted feature allowed the final output to be readily queried and spatially available to users. The platform’s credit-based hosting feature provided a cost- effective method of data hosting and a readily available dashboard. Table 23. Estimated cost ranges for TIM big data products. Cost Alternative Example For Use Case 1, the pipeline was implemented for free navigation app data from Minnesota, Utah, and Massachusetts. The pipeline considered only the TIM-related navigation app events (i.e., minor accident, major accident, hazard on road, object on road, roadkill, and car stopped in road). The team observed that there are fewer TIM-related events compared to other events (e.g., construction, rain or snow). Consequently, instead of having to split the processing across several functions, only one function was needed to collect the continuously updated records and deduplicate them, snap them to a data point created along ARNOLD, calculate elapsed time, and push those records to a GIS platform. This alternative approach removed a costly in-memory database, which cost $300–$400 per month to store 25 GB of ARNOLD data and provide 40 GB of processing. Due to these high costs, opting for a pay-as-you-go model allowed Use Case 1 to be developed and for the navigation app data to be cached into a NoSQL database in the cloud, designed to handle millions of requests from millions of locations. This database allows users to pay only for executed and stored queries. As such, the ARNOLD data were cached, and the necessary pipeline functions were run against the database while maintaining processing speed and reducing costs.

Next: Chapter 6 - TIM Big Data Guidelines »
Application of Big Data Approaches for Traffic Incident Management Get This Book
×
 Application of Big Data Approaches for Traffic Incident Management
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Big data is evolving and maturing rapidly, and much attention has been focused on the opportunities that big data may provide state departments of transportation (DOTs) in managing their transportation networks. Using big data could help state and local transportation officials achieve system reliability and safety goals, among others. However, challenges for DOTs include how to use the data and in what situations, such as how and when to access data, identify staff resources to prepare and maintain data, or integrate data into existing or new tools for analysis.

NCHRP Research Report 1071: Application of Big Data Approaches for Traffic Incident Management, from TRB's National Cooperative Highway Research Program, applies the guidelines presented in NCHRP Research Report 904: Leveraging Big Data to Improve Traffic Incident Management to validate the feasibility and value of the big data approach for Traffic Incident Management (TIM) among transportation and other responder agencies.

Supplemental to the report are Appendix A through Appendix P, which detail findings from traditional and big data sources for the TIM use cases; a PowerPoint presentation of the research results; and an Implementation Memo.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!