Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
71  Cloud-based platforms present a change in procurement and payment of data systems, storage, and processing for transportation agencies. Cloud servicesâwhether shared hardware, software, or information resourcesâare provided as a metered service over the internet. The simplest pricing modelâand default optionâfor cloud services is the âon-demandâ or âpay-per-useâ model. It is important to understand the distinction between procuring cloud services and procuring traditional on-premises systems. Traditionally, agencies would scope and procure physical servers for a predetermined price based on predefined requirements. Agencies could then use these servers to run as many processes or analytics as possible, within specifications, without additional cost (excluding any expansion of the server). The on-demand pricing model for cloud services requires a complete paradigm shift, which includes both benefits and challenges for transportation agencies. There are few up-front costs for the cloud, as no servers need to be purchased. On-demand services do not require long-term commitments. Storage and processing are automatically scaled up and down based on require- ments and demand or need. This flexibility adds complexity to procurement because the cost is dynamic; each process, query, or analysis has a direct cost. Cloud service providers may also increase the price of services when demand is high and decrease prices when demand is down. As such, agencies should consider and account for these cost fluctuations prior to developing cloud environments, data pipelines, and data products. Additionally, agencies should closely monitor and manage cloud environments to balance needs, performance, and costs, and they should redesign systems as necessary to minimize variability. There are services on the cloud to help users monitor and manage their system resources andâindirectlyâtheir costs. Cloud providers also offer a series of features to limit the risk of uncontrolled costs; they allow limits on computing resource use per month, week, or day to be implemented. They also offer guaranteed steady prices for systems that need to run continu- ously for several months or years. Overall, dynamic pricing means that cloud systems need to be monitored more closely than on-premises systems, processing and analyses need to be run at opportune times, and systems need to be designed to limit cost variability. To discuss the various costs associated with a cloud data environment and the TIM big data pipelines developed in this project, a generic cloud data environment is illustrated in Figure 43. Major components of a cloud data environment include the following, and each of these compo- nents will come at a cost to agencies: ⢠Data sources, both internal and external; ⢠Data storage (including storage of raw, managed, and curated datasets in a data lake), process- ing, and analysis; and ⢠Data products, which may involve business intelligence, dashboards, reports, or APIs. C H A P T E R 5 Estimated Costs of Cloud Environments and Data Pipelines
72 Application of Big Data Approaches for Traffic Incident Management This chapter provides information on the various costs of cloud environments. The costs of data sources; data storage, processing, and analysis; and data products are discussed in the con- text of the data environments and pipelines developed for this project. Due to the on-demand pricing model of cloud services, it is not practical to assign specific numeric values; the costs are highly variable, and they may quickly become outdated. Rather, estimated cost ranges for various components of the cloud environment/data pipelines are provided. These estimated cost ranges are categorized and defined in Table 20. 5.1 Estimated Cost of Data Sources for TIM Big Data Environments and Pipelines While most of the data sources used in the TIM big data pipelines were available for no charge, two sources of dataâthe weather API and CV dataâwere purchased for a fee. Most transportation agencies are accustomed to purchasing data from third-party vendors; for example, this can include probe vehicle data, which are used by most agencies. Increasingly, agencies are beginning to evaluate the value of CV data. The small sample of CV data used for this study (i.e., one month of data from Phoenix, Arizona) was negotiated for use in the research project Figure 43. Example cloud data environment. Cost Range Definition $âLow Estimated to cost less than $500 a month. This includes basic system access and a low level of analysis or processing with more limited resources. $$âMedium Estimated to cost between $500 and $1,500 a month. This is a more robust solution that involves more on-demand data storage and analytic resources, which are used to continually conduct real-time processes. $$$âHigh Estimated to cost over $1,500 a month. The cost for these environments can begin to vary quickly and drastically based on dedicated resources and storage and the use of curated data (i.e., data made immediately available for analysis and processing in more costly resources, with the benefit of operating significantly faster than other cloud-based analyses). Table 20. Estimated cost ranges for components of cloud data environment.
Estimated Costs of Cloud Environments and Data Pipelines 73 only. This sample is not representative of a CV dataset that would be of interest to many trans- portation agencies; larger geographic and time windows would be desired for most applications. A weather API provides a more robust and detailed source of weather data than most agencies have available in-house. While the weather API used for the data pipelines in this project is no longer available, there are many alternative weather APIs available, each of which offers its own type of services, data, and pricing structure. The team conducted a review of several weather APIs and compiled the following as example costs of weather data: ⢠Historical weather APIâexamples from two different service providers: â $150 per month includes 5,000 calls per day (one year back). â $180 per month includes 15,000 calls per day (three years back). ⢠Pay-as-you-call API: â 1,000 API calls per day for free; $0.0015 per call over the daily limit. ⢠Subscription plans: â There are no limits on the number of API calls; users pay for a subscription according to the actual use of the product. ⢠Professional subscription plans: â Fixed price per month and API call limits (number of API calls per minute and number of API calls per month). A few examples from one service provider: â¾ No monthly chargeâ60 calls per minute; 1,000,000 calls per month. â¾ $40 per monthâ600 calls per minute; 10,000,000 calls per month. â¾ $180 per monthâ3,000 calls per minute; 100,000,000 calls per month. For a small fee, transportation agencies can have access to historical and real-time weather information to use in data pipelines. Agencies are encouraged to review the various offerings and select the service that balances requirements of the application (e.g., types of data provided, historical versus real-time data, call frequency) and cost of the service. 5.2 Estimated Cost of Data Storage, Processing, and Analysis for TIM Big Data Environments and Pipelines As shown in Figure 43, data storage, processing, and analytics are the primary components of a big data environment or pipeline. This section provides estimated cost ranges for the data storage, processing, and analytics components of the big data pipelines developed for this project. 5.2.1 Estimated Data Storage Costs An example of a data lake configuration is shown in Figure 44, which illustrates different storage zones for raw, managed, curated, transient, and archived data. ⢠Raw data are created by data ingestion processes. These data are untouched (i.e., access limited to data administrators), retained indefinitely, and immutable. The raw data are used to regen- erate downstream datasets when needed. ⢠Managed data are created from the raw data by data processing pipelines. They are reformatted (more open) and often serialized (flattened). The managed data are augmented by adding quality and provenance labels. ⢠Curated data are created from the managed datasets by data processing pipelines. These data are enriched with additional information, can be considered âtrusted,â and can be queried for easy consumption. Curated datasets are organized to maximize data value and delivery. They are often developed for specific use cases. Typically, curated, readily available data require more resources and processing capability, which in turn increases the cost of these datasets. ⢠The transient data zone holds data that are in transit between zones stored in memory. This includes cached data and data used in cloud functions. Transient data need to be audited and
74 Application of Big Data Approaches for Traffic Incident Management cleaned up periodically, and they are removed as part of the data engineering process once the data are no longer needed. ⢠The archived data zone stores aged, curated data that is not needed for quick and efficient query (although the data can be restored for this purpose). Estimated cost ranges for these types of storage (where relevant) are shown in Table 21 in terms of each of the use cases/data pipelines developed for this project. 5.2.2 Estimated Data Processing and Analysis Costs Estimated cost ranges for data processing are shown in Table 22 in terms of each of the use cases/data pipelines developed for this project. 5.3 Estimated Costs of Data Products Estimated cost ranges for data products for each of the use cases/data pipelines developed for this project are shown in Table 23. Figure 44. Cloud data storage in a data lake.
Use Case Storage Component Estimated Cost Range Description 1 Real-time curated $$ to $$$â Medium to High ⢠The curated-data storage environment requires ongoing resources to support interaction with the data (including data queries). As such, the cost of this environment is dependent on the amount of data being stored and the resources needed for analysis. Having these resources readily available increases the cost of these data, as the data remain in a ready state. ⢠To lower the cost and size of the data within this more resource-intensive environment, any records with a timestamp after 12 hours were automatically removed. Archival $âLow ⢠The raw data are stored in an archival/long-term data store. The advantage of this data store is that, while it primarily meets retention and disaster recovery needs, it functions as a document store and does not allow for resource-intensive queries. As such, this data store is more cost effective, as the speed of response is not a priority. ⢠Archival storage was also used for the output files for retention and review of past incidents that were outside of the 12-hour curated-data storage time frame. 2 Geofence streaming database $âLow ⢠The data in this database are stored in memory for processing only. As such, the actual database storage cost is negligible because the results are stored in memory for immediate use in query. Crash document database $âLow ⢠The crash document database stores the incoming and updated records that are continually refined by the data pipeline processes. It does not serve as a recordkeeping location; crash records are removed after each crash is closed. Archival $âLow ⢠The updated crash documents are placed in archival storage for historical analysis of TIM performance (not intended for rapid query of the data). 3 Archival $âLow ⢠All crash data were archived in the data lake in ârawâ form (as received from the 10 states). While >10 million crashes is a large amount of data, in comparison to other use cases, this dataset is small (gigabytes of data). Curated $âLow ⢠Even when the secondary crashes were uniformized and enriched with ARNOLD and weather data, the data were still relatively small (gigabytes) and did not need to be available for real-time processing, which should keep the cost low. 4 Managed $$âMedium ⢠Only one storage mechanism was needed (managed dataset) because the remainder of the data for the dashboard were stored within a GIS platform (covered under data product costs). ⢠The data packets for CV data required a moderate amount of storage. While the amount of data was large in comparison to the other use cases, the data were still manageable (gigabytes) because the time and area of analysis (one month of data from Phoenix, Arizona) were small. However, storage costs could increase significantly as the geographic area and data retention needs increase. Table 21. Estimated cost ranges of storage for the TIM big data pipelines.
76 Application of Big Data Approaches for Traffic Incident Management Use Case Component Estimated Cost Range Description 1 Data ingestion and processing $âLow ⢠The processing functions included the ingestion and enhancements of the data (i.e., triggered events that performed the Collect, Snap, and Enrich functions, which prepared the data for use). Data were also transformed into a consistent format. ⢠The initial data pipeline was designed to make dedicated resources available (regardless of need) to achieve minimal latency and processing of the data. ⢠A modified pipeline, which focused on resource scalability rather than dedicated resources, effectively reduced the cost of the pipeline while achieving the desired processing/update speed. Real-time analysis and query of curated data $ to $$$â Low to High ⢠The data needed to be continuously readied for query, analytics, and display. ⢠While queries on the data were performed quickly and at fractional costs per request, costs could vary and rapidly change based on the number of users and requests. Costs will increase or decrease with the automatic scaling of resources based on requests. ⢠Outside of testing and validating this pipeline, queries were kept to a minimum. This kept the cost associated with the pipeline low. However, usage of the data pipeline in a production environment would increase the cost of this pipeline. 2 Real-time data processing and analysis $$â Medium ⢠This advanced spatial search was resource-intensive; it required a constant search and comparison across the data sources to identify spatial relationships within a geofence area. ⢠While each step was low cost, the volume of data and records being processed through the system increased the overall cost of the data pipeline. ⢠The use of tabular relationships and simple linking of events required little processing. During development, even as the volume of records increased across the initial pipeline, costs remained negligible until the geofence processing was enabled. 3 Historical query and analysis $âLow ⢠The ability to perform a desktop-level analysis, even when conducted in the cloud, resulted in a low-cost option for both the data query and analytics. ⢠Leveraging the cloud environment provided an elevated level of flexibility; resources could be managed and scaled up to perform the spatial-temporal and cluster analyses without requiring a local desktop machine or a robust server (which would have increased resources). This intensive, yet intermittent, analysis in the cloud provided a powerful environment at a low cost. 4 Simulated real- time data query and analytics $$ to $$$â Medium to High ⢠Comparisons and relationships were rapidly performed across locations, with variations in the search parameters. As the area of the search increased, the required resources also increased. ⢠The functions contained within the cloud environment included the analytic capability to process and enrich the data. While this includes many individual records, the project area and analysis time frame were kept small (i.e., one month of data from Phoenix, Arizona) and therefore required fewer resources. The larger the search area required to match CV data with a crash event, the higher the associated resources. Several variations were used to determine the geofence area necessary to provide an optimal balance between successful matches and needed resources. A half-mile boundary around the crash locations yielded the best results in matching the crashes with CV data. ⢠A date/time filter was used as part of the real-time simulation to limit the number of records that needed to be queried to find successful matches, which lowered the cost because it significantly reduced the number of records that were required to be searched. Table 22. Estimated cost ranges for processing and analysis within the TIM big data pipelines.
Estimated Costs of Cloud Environments and Data Pipelines 77 Use Case Component Estimated Cost Range Description 1 GIS platform online and hosted layer $âLow ⢠The use of a GIS platform online allowed the data to be stored and readied for immediate display on the dashboard. The platform uses a cost system based on credits. Using the platformâs credits for hosting and analysis provided an alternative option to store, query, and display the data. As an example, $500 a year can provide between 350â500 credits. The online dashboard for this use case required only about 0.4 credits per day (or about 12 credits per month). 2 Not Applicable Not Applicable ⢠No data products were created for this use case. 3 Results of analysis $âLow ⢠The data product for this use case included the results of the analysis of the curated dataset (i.e., same cost as the analysis component). 4 Online GIS and hosted layer $âLow ⢠Using the GIS platform as a hosted feature allowed the final output to be readily queried and spatially available to users. The platformâs credit-based hosting feature provided a cost- effective method of data hosting and a readily available dashboard. Table 23. Estimated cost ranges for TIM big data products. Cost Alternative Example For Use Case 1, the pipeline was implemented for free navigation app data from Minnesota, Utah, and Massachusetts. The pipeline considered only the TIM-related navigation app events (i.e., minor accident, major accident, hazard on road, object on road, roadkill, and car stopped in road). The team observed that there are fewer TIM-related events compared to other events (e.g., construction, rain or snow). Consequently, instead of having to split the processing across several functions, only one function was needed to collect the continuously updated records and deduplicate them, snap them to a data point created along ARNOLD, calculate elapsed time, and push those records to a GIS platform. This alternative approach removed a costly in-memory database, which cost $300â$400 per month to store 25 GB of ARNOLD data and provide 40 GB of processing. Due to these high costs, opting for a pay-as-you-go model allowed Use Case 1 to be developed and for the navigation app data to be cached into a NoSQL database in the cloud, designed to handle millions of requests from millions of locations. This database allows users to pay only for executed and stored queries. As such, the ARNOLD data were cached, and the necessary pipeline functions were run against the database while maintaining processing speed and reducing costs.