National Academies Press: OpenBook

Guidebook for Managing Data from Emerging Technologies for Transportation (2020)

Chapter: Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation

« Previous: Chapter 2 - Laying the Foundation
Page 11
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 11
Page 12
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 12
Page 13
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 13
Page 14
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 14
Page 15
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 15
Page 16
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 16
Page 17
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 17
Page 18
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 18
Page 19
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 19
Page 20
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 20
Page 21
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 21
Page 22
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 22
Page 23
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 23
Page 24
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 24
Page 25
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 25
Page 26
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 26
Page 27
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 27
Page 28
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 28
Page 29
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 29
Page 30
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 30
Page 31
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 31
Page 32
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 32
Page 33
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 33
Page 34
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 34
Page 35
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 35
Page 36
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 36
Page 37
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 37
Page 38
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 38
Page 39
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 39
Page 40
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 40
Page 41
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 41
Page 42
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 42
Page 43
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 43
Page 44
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 44
Page 45
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 45
Page 46
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 46
Page 47
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 47
Page 48
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 48
Page 49
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 49
Page 50
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 50
Page 51
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 51
Page 52
Suggested Citation:"Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 52

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

11 This Roadmap to big data represents an organic, bottom-up approach for transportation agencies that relies on an iterative process to grow big data use cases, pilot projects, and ultimately value for an organization. An agency may embark on this Roadmap and guidance for various reasons. For example, a data source may not be effectively used or processed due to limitations in an agency’s existing traditional data system (e.g., connected vehicle or crowdsourced data too big or unstructured to fit within the system). There may be a specific problem that could be addressed using data (e.g., data from a third-party scooter provider to help manage scooter drop locations). Perhaps the city is implementing a smart city project that involves building a system that will handle the requisite volume and variety of data from all partners. Whether an agency is starting from scratch with a new technology data set, has an issue or problem that might be solved with emerging technology data, is already working on a big data project, or is even looking for a new enterprise data management solution, this guidebook will be applicable. The steps and guidance outlined in this Roadmap are designed to walk an agency through the process of developing the knowledge, projects, environment, and buy-in to move incrementally from a traditional data management approach to establishing data management policies, proce- dures, and practices that fully meet the needs of data from emerging technologies. The steps are in conjunction with the best practices and recommendations in the Modern Big Data Manage- ment Framework. This Roadmap to big data represents an organic, bottom-up approach for transportation agencies that relies on an iterative process to grow big data use cases, pilot projects, and ultimately value for an organization. It allows an agency to start small at the day- to-day operational level and to expand and grow interest and use both horizontally and vertically across the organization over time, with the ultimate goal of effective organizational change. Overview of the Roadmap The Roadmap includes eight steps. Figure 4 illustrates them, and each step is described briefly following the figure. • Step 1. Develop an understanding of big data. With a new data source or a new big data project (or the desire for either one) at hand, the first step involves an agency champion or champions developing a general knowledge and understanding of big data using the infor- mation presented in this step; information in the Modern Big Data Management Framework section; and additional resources referenced herein. The goal at the end of Step 1 is to have enough understanding to promote a big data approach within the organization. • Step 2. Identify a use case and an associated pilot project. In this step, the champions and team identify a use case and an associated pilot project that will resonate with their leader- ship. This use case is likely something that addresses the pain points of the group/division/ business unit and that cannot easily be addressed without the use of the data sets of interest. In some cases, a use case may be handed to the champions from the top down, with the charge of demonstrating value for a particular data set or project. C H A P T E R 3 Roadmap to Managing Data from Emerging Technologies for Transportation

12 Guidebook for Managing Data from Emerging Technologies for Transportation • Step 3. Secure buy-in from at least one person from leadership for the pilot project. In this step, the champions and team work to communicate the value of the pilot project and to secure buy-in for the project from at least one person from their leadership. This is a critical step in that, without this buy-in, the project is likely to fail. One champion from leadership can be key to ensuring success of the pilot and expansion to other groups/divisions/business units within the agency. • Step 4. Establish an embryotic big data test environment. As this is a big data initiative, it will require building an embryotic big data test environment or “playground,” in which the pilot project can be developed. In this step, this embryotic environment is developed following as many of the big data best practices and recommendations identified in the Modern Big Data Management Framework section of this guidebook. • Step 5. Develop the pilot project within the big data test environment/playground. In this step, the team develops the big data pilot project within the test environment with iterative feedback from the leadership champion. This development will require the application of modern big data approaches and analytics and the development of data visualizations and products. As such, expertise with these techniques is required, which can be acquired in various ways, the pros and cons of which are presented. • Step 6. Demonstrate the value of the data to other business units. In this step, the team and the leadership champion begin to market the data visualizations and products developed in Step 5 to other business units horizontally across the organization. This horizontal organi- zational outreach will help to market the value of the data and the data products to other mid-level managers and to identify other potential use cases and pilot projects that can be developed within the test environment. • Step 7. Demonstrate the value of the data to executive leadership. In this step, the team and the pilot project leadership champion begin to market the data visualizations and products developed in Step 5 (including any new use cases/pilot projects that have been developed by or for other business units) to other leadership/executives within the Figure 4. Big data Roadmap for transportation agencies.

Roadmap to Managing Data from Emerging Technologies for Transportation 13 organization. This vertical organizational outreach will not only help to market the value of the data and the data products to executive management to identify other use cases that can be developed within the test environment but also begin to gain executive support for organizational change. • Step 8. Establish a formal data storage and management environment. After many itera- tions of Steps 2 through 7 (which could take several years), this step will establish a formal data storage and management environment and will institutionalize policies, procedures, and practices associated with this environment that represent an organization-wide shift from traditional management practices to modern data management practices. Recognizing that changes happen rapidly and that technology is now disposable, this step requires con- tinuous improvement, including enhancing the use of data sets, re-evaluating data pipelines, and reviewing system architecture. Implementing modern big data management principles within an organization requires a multidisciplinary team, and the skills and personnel required will evolve over time. As an agency makes its way through this guidebook, those resource needs will become clearer. For example, early on in the process (Steps 1 through 3 of the Roadmap), staff with specific business knowledge and needs and others with influence are more involved. Developing the big data environment and projects (Steps 4 and 5 of the Roadmap) will involve more technical people, including cloud architects, big data analysts, and counterparts from the information technology (IT) department. After the environment has been built and projects have been developed, demonstrating value for the approach and data products (Steps 6 and 7 of the Roadmap) will involve those who can successfully show business value and who can promote and achieve wider organizational change. Case Study: The Road to Big Data The Kentucky Transportation Cabinet (KYTC) is most likely the best example (at the time of writing of this report) of a state transportation agency’s journey to implement real-time transportation management solutions using a modern big data approach. The journey was sparked by three events: the increasing costs of snow-and-ice operations, the Cabinet’s new data-sharing partnership with Waze, and interest in a new database technology called Hadoop. In the winter of 2012−2013, KYTC experienced record costs for snow and ice operations. Costs for that winter were approximately $70 million, a significant hike from a historic average of approximately $50 million a year. Based on historic data, KYTC could expect to experience high costs for the following two winters as well. Therefore, decision makers set in motion a plan to better leverage existing real-time automatic vehicle location (AVL) data from snowplows to help control those costs. The director of Maintenance tasked the ITS development team with addressing this issue using the AVL data. By the summer of 2014, the ITS team had developed a rudimentary, real-time, proof-of-concept snow-and-ice system to show the value of tracking snowplow activities in conjunction with Doppler radar. In September 2014, KYTC signed an agreement to be part of the Waze Connected Citizen Program. Executive leadership had been briefed on the partnership, but the full benefits of the partnership and the associated data were completely unknown at that time. The goal was simply to help Kentucky motorists better navigate the roadways and to provide additional reporting options on the 511 system, GoKY. Handling the details of the partnership and data-sharing agreement were again assigned to the ITS development team. When the ITS personnel responsible for the proof-of-concept snow-and-ice system discovered that road weather reports (continued on next page)

14 Guidebook for Managing Data from Emerging Technologies for Transportation were also included in the Waze data, they decided to add the Waze data into the snow-and-ice system to see what would happen. Two months later, Kentucky received the first snow of the season. At that point, the snow-and-ice system was processing real-time data from approximately 200 snowplows every 10 seconds, approximately 200−300 Waze reports every 2 minutes, and was pulling statewide Doppler radar images every 5 minutes. The system crumbled. It required constant reboots and did not provide the stability or throughput to sustain the operations. Worse yet, ITS personnel understood that KYTC would need much more data in future iterations to build a true, real-time statewide snow-and-ice decision support system. Fortunately, the developer assigned to the task knew someone in the IT department who was trying to find a good use case for a new big data database and processing architecture called Hadoop. A few days following a meeting of the minds, the team started pushing data from the real-time snow-and-ice processor into Hadoop using one of the low latency technologies contained within the system. This approach provided the stability and throughput the real-time processor needed to continue working. Over the years, the system has grown by steadily meeting the needs of additional use cases either by incorporating new data sources or by repurposing existing data to be used by a different group of specialists within the Transportation Cabinet (Figure 5). The system and much of the data have developed into an enterprise-ready solution and have far outlived the original snow-and-ice use case for which the system was originally designed. As of fall 2019, the system will be incorporated into the long-term enterprise architecture plans of the organization where it will undoubtedly take on a much larger and more fundamental role within the agency. The system has matured through the phases of proof of concept to being enterprise-ready with a few production use cases to finally being recognized and adopted as an integral part of the enterprise for integrating, processing, storing, analyzing, reporting, and republishing data. Case Study: The Road to Big Data (Continued) Number of Servers Incoming Data Sources Shared Data Business Use Cases Figure 5. Kentucky Transportation Cabinet big data system growth.

Roadmap to Managing Data from Emerging Technologies for Transportation 15 Step 1. Develop an Understanding of Big Data The first and most crucial challenge in building a modern data system is over- coming a lack of knowledge about big data. Accomplishing and completing the steps along the big data Roadmap will rely on the following. An agency will need to inform and educate more and more stakeholders within the agency about the value of big data, how the big data approach differs from the existing traditional approach, why the agency needs to make fundamental changes in how it views and manages data, and what organizational outputs and outcomes can be anticipated from these changes. One or more champions within the organization will need to gain a solid foundational knowledge of big data concepts. These data champions need not be experts nor must an organization hire data scientists or a chief data officer to begin this process. However, every organization must know what big data is, how to handle big data, and why it matters to them. Furthermore, these champions must also be comfortable enough in that knowledge to be able to effectively communicate it with anyone, from top level executives to front-line data users. While foundational information on the basics of big data will be covered in this chapter, the Modern Big Data Management Framework included in this guidebook and Web-Only Document 282: Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making are other good sources to reference to as part of this step. Both reports contain industry best practices, and the Framework contains associated recommendations for creating, storing, using, and sharing data. Developing a rudimentary understanding of proven modern big data management approaches early on will pay dividends throughout the process because costly errors and pitfalls are avoided. What Is Big Data? Big Data is a popular term but what does it really mean? Below are just a few of the many definitions of big data: • Big data may refer to data sets, typically consisting of billions or trillions of records, that are so vast and complex that they require new and powerful computational resources to process (Big Data 2019). • Big data is an approach to generating knowledge, in which a number of advanced techniques are applied to the capture, management, and analysis of very large and diverse volumes of data. These are data so large, so varied, and analyzed at such speed that the data exceed the capabilities of traditional data management and analysis tools (Burt, Cuddy, and Razo 2014). • Big data may encompass all of the non-traditional strategies and technologies needed to gather, organize, process, and generate insights from large data sets (Ellingwood 2016). • Big data are extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions (Lexico Powered by Oxford 2019). • Big data is a term that describes the large volume of data—both structured and unstructured— that inundates a business on a day-to-day basis. But it is not the amount of data that is important, what organizations do with the data is what matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves (What is Big Data 2019). • Big data is a new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions (Press 2014). Within these few definitions, a number of concepts are revealed that demonstrate that the term big data represents more than just the volume of data—big data is an “approach,” it is about “analyzing” the data to extract information to inform better decision-making, it is an “attitude.” Storing data in “data silos” has been the key obstacle to getting the data to work in ways to improve businesses, work, and lives.

16 Guidebook for Managing Data from Emerging Technologies for Transportation “We have so much data and so much technology giving us data that there is no human that can keep up with it.” —Delaware DOT The last definition is an acknowledgment that storing data in “data silos” has been the key obstacle to getting the data to work in ways to improve businesses, work, and lives. Big Data Characteristics Most people have encountered the five characteristics or “Vs” of big data: volume, variety, velocity, veracity, and value. Without some context, however, these terms may seem nebulous. Following is a brief review of these terms as they apply to transportation agency data. Volume characterizes the main aspect of a big data set. Big data are generally considered to be over a terabyte (TB); however, the size characterization of big data is continuously changing. As an example, Walmart processes 2,500 terabytes of data every hour (Marr 2017). While most transportation agencies are still working with “small” data, the amount of data available to transportation agencies continues to grow. Examples of data volumes that are reaching big data levels for transportation agencies include the following: • 0.222 TB per year—Waze alerts for one state • 0.65 TB per year—data generated by 300 traffic management center (TMC) field devices (Gettman et al. 2017) • 1.2 TB per year—data from 300 CCTV cameras if stored (Gettman et al. 2017) • 2.3 TB per year—statewide vehicle probe speed data for one state • 4 TB of data every day—estimate for a single automated vehicle (Swaney 2019) Variety refers both to the “structured” and “unstructured” data present in big data work- flows, as well as to the ability to combine and use these various data types to gain insights that were difficult or impossible to obtain prior to big data analytics. Structured data are data that are structured to be easy for machines to handle, especially when it comes to searching, sorting, or storing data in relational databases. Unstructured data are the opposite of that: video files, audio files, free-form text, and other data that do not conform to traditional data structures and are therefore difficult for machines to categorize. There are entire fields within the study of data science, such as natural language processing and computer vision, that are devoted to helping machines do more with unstructured data. Just as traditional relational databases cannot store different varieties of data, traditional data analysis techniques cannot quantify various data types against each other. With the advent of modern big data techniques, however, data from two separate sources can be combined to gen- erate new insights. One example of this is traffic incident management, where a structured table of incident report data can be enhanced with roadside camera data captured at the time and loca- tion of an incident. Using computer vision techniques, vehicles and license plate numbers can be identified and speeds can be calculated that could then be added to the incident report. Another example is extracting slowdown or weather information from unstructured commuter tweets and using that to add context to speed data coming from global positioning systems probes. Velocity is defined as the speed at which data are generated. What is not often covered or explained is the variation of velocity that can take place for any given data source. Crowd- sourcing and social media are good examples. The velocity of these data sources is typically much higher than traditional transportation data sources and can be highly responsive to newsworthy events. One agency utilizing social media and crowdsourced data asked users to report road closures due to water over the roadways during flooding. As a result, users responded to the request by sending in twice as many reports as usual, jumping from 1,500 to 3,000 in a single day. If an agency is unprepared for high, and highly variable, volumes of data, it will not be able to make effective use of the data. Veracity refers to how accurate or truthful a data set may be. In the context of big data, veracity is not just about the quality of the data itself, but how trustworthy the data source,

Roadmap to Managing Data from Emerging Technologies for Transportation 17 type, and processing of the data are (see Veracity 2019). For example, using sentiment analysis to extract information from traveler tweets is inherently more uncertain than analyzing road sensor data. Other common big data techniques, such as classification and predictive analytics, do not generate an exact result, rather a predicted value with an associated confidence score, resulting in reported values or statistics that have confidence bands and levels of uncertainty rather than absolute “truths.” Just as big data analytics deal with confidence levels, the management and preparation of big data for analysis deal with confidence levels as well. With traditional systems, the rigid structure of relational database management systems requires that data go through extensive preprocess- ing before they can be loaded into a database. This means that incomplete or aberrant data are often purged before ever being stored in the system, producing cleaner data at the cost of analytical flexibility. Because big data techniques make use of all data, even incomplete entries or outliers, the modern approach instead calls for scoring and flagging suspect data rather than removing them. This approach allows analysts and researchers more flexibility in the data they use but calls for more awareness of the relative veracity or trustworthiness of the data, as they have not necessarily gone through a strict preprocessing step like traditional data must. Value denotes how big data sets contribute to improving the status quo. Value involves determining a benefit and estimating the significance of that benefit across any conceivable circumstance. If a new data set provides answers to important questions of interest, provides new business opportunities, or leads to better decisions, then it can be deemed valuable to an organization. Because of this, value is perhaps the most important of the five Vs (as evidenced by the previously listed big data definitions). Big Data Concepts Beyond the five Vs, there are a number of important concepts to understand about big data. They are: • Data lake. A data lake is a system. Within this system is a variety of data communally stored in their raw, unprocessed format. This system is the opposite of “siloed data,” which are stored on disconnected systems that cannot easily communicate with each other. A data lake archi- tecture is particularly beneficial when working with big data that rely heavily on large amounts of raw data and can effectively combine data from multiple types and sources to produce valuable insights. • Cloud. Online accessible virtual infrastructure, software, or other IT services that are hosted on large external server clusters rather than in house. Cloud services are popular and almost ubiquitous when working with big data, because they offer scalability, flexibility, reliability, availability, and cost-effectiveness that cannot be obtained using on premise infrastructure alternatives due to the large number of servers and the pay-as-you-go model. Cloud storage services are typically the first cloud service organizations adopt, as they can greatly reduce the costs associated with storing, managing, archiving, sharing, and securing large amounts of data (as compared with on premise). • Distributed computing. A method of performing a single computing task more efficiently by dividing it across multiple servers. An analogy is assembling a team of horses to pull a carriage as opposed to using a single large horse. The concept of distributed computing is widely used in big data, as individual servers are often too small to handle big data process- ing tasks on their own. Distributed computing is implemented through distributed com- puting frameworks that run directly on a cluster of servers. Distributed computing frameworks allow computing tasks to scale easily, as they only need the addition of new servers to their cluster to improve their performance rather than having to upgrade or replace them. Apache Hadoop is the most well-known framework for creating clusters of distributed computing resources.

18 Guidebook for Managing Data from Emerging Technologies for Transportation • Distributed storage. Similar to distributed computing, distributed storage is the technique of storing large amounts of data on a distributed network or cluster of drives/servers. This technique requires a distributed file system to manage the files and present the storage as a single file system, the most widely known of these being the Hadoop Distributed File System (HDFS). Cloud service providers typically handle the management of this process in a method that is transparent to the user. • Nonrelational databases. Because big data involve a large variety of data changing at a rapid pace, it can be difficult, at times even impossible, to fit these data neatly and efficiently into a single relational table structure. To remedy this, nonrelational databases have been devel- oped for storing and processing big data. Nonrelational databases are also called “NoSQL” or “Not Only SQL” databases. NoSQL databases do not comply with the ACID (Atomic, Consistent, Isolated, and Durable) model, which guarantees safe data operations on rela- tional databases; rather, they use the BASE (basically available, soft state, eventually con- sistent) model, which is looser than ACID but allows adjustments to changes in the data rapidly. While these databases often use their own query language, many have recently adopted SQL-like syntax for ease of use. • Common big data analytics techniques. A few of the common techniques enabled by big data include classification (a classifier is trained on known data to be able to sort new data entries into a particular class); prediction (existing data points are used to predict future data points); and natural language processing (language as it is naturally written or spoken is converted into machine-readable data). Transportation applications of these techniques can vary from predicting the location and severity of traffic crashes to extracting road condition data from tweets and blog posts. When to Pursue Big Data While a single, new, large data set (such as data from a connected vehicle pilot project or probe speed data purchased from a third party) may drive a transportation agency to pursue big data, it is not necessarily the volume of data at hand that indicates an agency’s readiness or need to pursue big data. It could be a combination of Vs that drive a need or the need for a new enterprise data management system and the recognition of the need for more flexibility and scalability. While transportation agencies certainly are dealing with larger data sets than they ever have, as well as a variety of data types, value might just be the single most important V of the Vs driving the need for big data. Most agencies are driven to pursue modern data management practices for one of two reasons: they are encountering either an exciting new data set or use case that requires big data management in order to use or they see an opportunity to reduce costs, improve efficiencies, or gain some other benefit from modernizing their current data approaches. Most agencies have already encountered one or both of these two situations and, with the rapid advancement of new technologies and data management techniques, it is inevitable that every agency will be forced to deal with big data at some point soon. Even in the increasingly rare situation where an agency does not yet see a need for big data, by preparing for big data and updating data management approaches now, agencies will be much better equipped to deal with the transi- tion when it becomes an absolute necessity. The copious amounts of unstructured data common among connected vehicles, automated vehicles, and other emerging technologies cannot feasibly be stored or analyzed using tradi- tional data management techniques. These technologies, along with probe speed data, crowd- sourced data, and data from IoT devices, are used by public and private organizations today. Without knowledge of big data management techniques, public agencies may find themselves too far behind the curve to realistically catch up with private sector capabilities, leading some to find themselves forced to rely on third-party contractors more heavily than would be their preference.

Roadmap to Managing Data from Emerging Technologies for Transportation 19 Common Misconception: Transportation Agencies Only Need to Purchase More Servers to Store Big Data Despite what the name may imply, big data is not a larger version of traditional data. Rather, big data are so radically different from traditional data that they cannot adequately be collected, stored, or analyzed using traditional techniques. As big data are not “traditional data only larger,” big data management cannot be “traditional data management only larger.” For example, if an organization has a traditional relational database and wants to add real-time streaming data from a connected vehicle project, it will be impossible to do so effectively regardless of how performant the relational database system is. The challenge is not simply a matter of bandwidth or processing speed, the challenge is that the traditional database is ill suited to handle the kind of live geospatial data common among connected vehicle technologies. No number of additional servers will allow a data system to manage data structures for which the system was not designed. Even when not strictly necessary, adopting modern data management approaches, such as a cloud-based data lake environment, can result in less administrative overhead and more efficient workflows. This increased efficiency may help make a business case for developing modern data capabilities now in preparation for the big data applications that will surely pre- sent themselves in the future. It takes time and effort to build big data management capabilities; starting to demonstrate the value of more modern data management—even on a set of pilot data—can get agencies moving in the right direction so that they are not in a position of need- ing to play catch up after it is too late. Case Study: The Importance of Understanding Big Data In one transportation agency, the lack of education about big data architecture and methodologies among decision makers resulted in several mistakes, including hardware procurement, development, data maintenance, and reporting. Traditional thinking, and the lack of understanding concerning the benefits and pitfalls of horizontal versus vertical scaling, resulted in the purchase of servers with inappropriate specifications for the data being managed. Utilizing on premise architecture requires a certain level of understanding about properly scaling central processing units (CPU), random access memory (RAM), and storage ratios. In addition to misunderstanding the concept of scaling, the agency was not fully aware of the breadth of software solutions available to perform similar functions. In this case, the agency decided on a complex data aggregation tool typically used for IoT data for simple once-per-day batch jobs. The agency also had trouble understanding the concept of a data lake, specifically how it contains raw data as opposed to processed data. The perception was that preserving and storing raw data, much of which were outside the scope of the original pilot project, served little to no value. Ultimately, the agency found that these raw data, once thought to be useless, enabled several new use cases that went above and beyond the original expectations of the project.

20 Guidebook for Managing Data from Emerging Technologies for Transportation Additional Resources Suggested additional resources for review follow. An Introduction to Big Data Concepts and Terminology. This journal article provides a definition of big data, addresses why big data systems are different, discusses the concepts and tools associated with the various steps of the big data life cycle, and provides a big data glossary (Ellingwood 2016). Big Data—Concepts, Applications, Challenges, and Future Scope. This journal article introduces big data concepts and provides multiple case studies showing how big data are used in real-world applications today. It also includes a helpful section outlining common obstacles organizations face in implementing big data. The authors offer insight into why these practical challenges persist and how they can be overcome (Mukherjee and Shaw 2016). Big Data and Cloud Computing: Innovation Opportunities and Challenges. This journal article provides clear descriptions of current big data technologies, along with a frank discus- sion of the challenges present when adopting big data technologies. Notable chapters include Chapter 3, which offers insight into obstacles faced within specific aspects of big data manage- ment, and Chapter 5, where cloud computing benefits and approaches that overcome these obstacles are discussed (Yang, Huang, Li, Liu, and Hu 2017). New Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe. This book provides in-depth information on big data storage, processing, and analysis that is generalizable to nearly any organization, including U.S. transportation agencies. Of particular interest is Chapter 7 on data storage, offering informative detail on modern data- bases, distributed platforms, and methods to secure sensitive data on each (Cavanillas, Curry, and Wahlster 2016). Beyond the Hype: Big Data Concepts, Methods, and Analytics. This journal article provides a clear and concise definition of big data. The paper also provides an overview of available big data applications, including text mining, video/audio analytics, and predictive analytics (Gandomi and Haider 2015). NCHRP Research Report 865: Guide for Development and Management of Sustainable Enterprise Information Portals. This report sets forth guidance and recommendations for state transportation agencies to build sustainable systems to collect, store, analyze, and disseminate data. Sustainability refers to the ability of a system to handle changes and disruptions (e.g., sudden growth in data volume, sudden changes in technology, sudden changes in data quality, or security breaches) without being taken down and rebuilt at a large cost. The concept of sus- tainability is foundational for the collection, aggregation, analysis, and dissemination of data from emerging technologies. The report includes technology recommendations for building sustainable platforms, data governance guidance, software deployment guidance, acquisition recommendations, and more (Pecheux, Shah, and Miller 2019). NCHRP Research Report 904: Leveraging Big Data to Improve Traffic Incident Manage- ment. This report provides guidelines and recommendations for transportation agencies to leverage big data to improve traffic incident management (TIM), although the guidelines extend well beyond the TIM use case. The report provided an in-depth explanation of big data, big data architecture, and big data analytics techniques. The report presents and applies two data maturity models to 30+ transportation-related data sources, including data from emerging technologies (Pecheux, Pecheux, and Carrick 2019). Big Data’s Implications for Transportation Operations: An Exploration. The purpose of this white paper was to expand the understanding of big data for transportation operations, the value it could provide, and the implications for the future direction of the U.S. Department of Transportation’s Connected Vehicle Real-Time Data Capture and Management Program.

Roadmap to Managing Data from Emerging Technologies for Transportation 21 This paper also identified two additional, broad areas where big data analytical approaches may be able to provide further value, including transportation system monitoring and management and traveler-centered transportation strategies (Burt, Cuddy, and Razo 2014). Big Data and Transport Understanding and Assessing Options. This report examined issues relating to the arrival of massive, often real-time, data sets whose exploitation and amalgama- tion can lead to new policy-relevant insights and operational improvements for transportation services and activity. The report comprised three parts. The first part gives an overview of the issues examined. The second part broadly characterizes big data and describes its production, sourcing, and key elements in big data analysis. The third part describes regulatory frameworks that govern data collection and use and focuses on issues related to data privacy for location data (OECD/ITF 2015). Step 2. Identify a Use Case and an Associated Pilot Project By this point, the agency champions should have developed a general under- standing of big data concepts, benefits, applications, and best practices in Step 1. In Step 2, the agency champions will identify a use case for the data of interest and an associated pilot project in which to demonstrate the value of the data. If leadership has already identified a use case or pilot project, and the agency champion is charged with carrying out the project, proceed to Step 4. In most cases, agencies will be driven to this guidebook because of the need to work with new data that does not fit with their traditional database management systems and practices. To give the reader more context, Table 2 lists a few potential drivers for change and examples of associated use cases and pilot projects. Select a Use Case and Pilot Project that Align with Business Units, Leadership, and Organizational Goals The primary goals of the pilot project are to demonstrate value for the data at hand (as well as the associated modern data management approach) and to create a success story that will drive additional use cases and pilot projects throughout the agency. To improve the chance that the pilot project chosen will resonate with leadership, select a use case and pilot project that meet one or more of the following: • Addresses a clear and evident need of the business unit. The use case and pilot project will inevitably be linked to the business unit of the champion leading the initiative and should address a clear and evident need for that business unit. For example, a TMC manager (cham- pion) knows that there are routes outside the current areas for safety service patrol where incidents are increasing, and incident response and clearance times are longer than those inside the service area are. He or she needs to use data that will help to make the business case for additional routes and vehicles. There are new data that can provide information on incident types, locations, and speeds that can be assessed to help make this business case. • Addresses leadership pain. While the pilot project might help the business unit with a problem, leadership may care more about solving other problems under their purview. Pay attention to leadership. Figure out what is bothering them and how the pilot project can help them. For example, instead of presenting the project as a way to make things better for staff, frame it as a way to make staff more productive. • Helps leadership meet their goals. Understand where leadership wants to be (i.e., their goals) and how the selected project can help them get there. They will want to know how the project can help them too. Delaware DOT conducted a 30- to 60-day pilot project with a few selected data sets as a proof of concept for a new cloud-based solution. This pilot project duplicated critical data from on premise storage to the cloud for more efficient access and use.

Drivers for Change Example Big Data Sources Example Use Cases/Pilot Projects An agency faces an issue or problem that requires new data and new methods, as the issue or problem cannot be addressed easily or efficiently with the current systems and data alone. Crowdsourced data: Crowdsourced data generated through mobile apps such as Waze can help address transportation issues or problems easily and efficiently. Crowdsourcing turns transportation system users into sensors, providing real-time data on traffic conditions, operations, and driver behaviors well beyond the boundaries of the fixed sensors and cameras currently available to transportation agencies. • Demonstrate early incident detection to improve traffic incident response and clearance times statewide. • Combine with automatic vehicle location (AVL) data to improve the treatment for snow and ice during winter storms, while reducing costs. An agency has acquired a new data set, but it is not being used to its fullest potential due to limitations in infrastructure, tools, and skill sets. A business unit within an agency wants to purchase (or is currently testing) a new data set, and they need to demonstrate the business case for purchasing it. Vehicle probe speed data: Like actively crowdsourced data, mobile apps can produce passively crowdsourced data such as vehicle probe speed data. These data, which are usually purchased from a third party, can provide agencies with new insights into the status of their roadways beyond the boundaries of their fixed sensors and cameras. • Manage traffic better through work zones and detours. • Re-time traffic signals more frequently and without sending staff to conduct field studies. • Inform transportation planning strategies and investments. An agency is conducting a connected vehicle pilot project, which is producing data that cannot be used directly by the agency due to its size and structure. Connected vehicle data: Anonymous signals in connected vehicles are generating new data about how, when, and where vehicles travel, as well as vehicle status information such as vehicle position, heading, speed, and predicted path. This new data-rich environment is the beginning of new safety and mobility applications that will improve safety, help to keep traffic flowing, and make it easier for people to plan their travel. • Assess the quality, reliability, and trustworthiness of the data being generated and potential uses and applications for the agency. • Combine with agency data to understand better where, when, and why crashes occur. • Exchange data with roadside infrastructure like traffic signals to improve mobility and safety. An agency needs access to data about new mobility solutions that are emerging within its city to develop more informed policies and plans to help maximize the benefits of these services, while reducing the potential drawbacks. Mobility and shared mobility data: Data generated by private shared fleet operators deploying vehicles, including Uber, Lyft, scooters, and bikes. These services are amassing large amounts of data on when, where, and how people travel. • Identify where to expand micro-mobility infrastructure (e.g., lanes and parking) to bring about needed change in how streets are allocated. • • Assess and incentivize expanding access to mobility in low-income and historically underserved areas. An agency needs an efficient method for meeting data and reporting demands and requirements, and the current database management system is not getting the job done. Agencies may already be experiencing the limitations of their existing systems with respect to data storage and processing. Demonstrate throughput and cost efficiencies associated with modern data management systems and practices. Table 2. Example drivers for change, big data sources, and use cases/pilot projects.

Roadmap to Managing Data from Emerging Technologies for Transportation 23 • Aligns with or links to organizational goals or objectives. Beyond the needs of the leader- ship of the business unit, select a pilot project that aligns with, or links to, the overall goals, objectives, or mission of the organization. This will resonate with executive leadership and help the business unit leadership look good. • Addresses policy makers’ concerns or goals. Even beyond the executive leadership, how can the pilot project address the concerns or goals of policy makers? • Expands to include other data sets and use cases. Start small but think big. The pilot project is not a one-off; it can be expanded to include other data and use cases to demonstrate the benefit of the data and the approach to other business units. Engage Others in the Cause In addition, consider engaging others to help identify a project that will resonate with leader- ship. For example: • Internal to business unit. Others within the business unit, particularly those on the front lines of the day-to-day operations, are likely to be the most interested in and supportive of use of the data, especially if they understand the potential uses and benefits. These staff under- stand the existing limitations and gaps and can help to advocate for the need for change. “Start small but think big.” —Kentucky Transportation Cabinet Case Study: Portland Urban Data Lake Pilot Project The goals of the Portland Urban Data Lake (PUDL) pilot project are to collect and store data from a variety of sources; develop analytics that create new insights from the data; and explore technologies and architectures for providing standardized, documented access to data for public sector agencies and local innovators. The city of Portland recognized that it did not yet have good, centralized systems in place for managing, integrating, and analyzing the data it has, much less the large volumes of data coming from Smart Cities technologies like sensors, connected vehicle infrastructure, and private sector services. As an example, the city found itself ill equipped to handle the streaming data from a Mobility on Demand (MOD) scooter project, which put a tremendous strain on its existing data systems. This project underscored the need for a more advanced way of managing these types of technology data. By building a unified data lake environment, Portland helps “City leadership and City staff to make and evaluate decisions, design and evaluate policies and programs, enhance community engagement, and allow us to better partner with the private sector, researchers, and non-profits to meet City goals around livability, affordability, safety, sustainability, resiliency and equity” (Portland Urban Data Lake n.d., retrieved November 2019, from Portland Bureau of Transportation: https:// www.portlandoregon.gov/transportation/article/681572). PUDL handles various data, including data from IoT devices, origin-destination data from MOD scooters and bikes, Waze traffic data, pedestrian counts, and more. Having these data in one place allows analysts to merge data sets and perform deep analyses more efficiently than they would have otherwise been able to do. This data lake also supports the Portland Bureau of Transportation’s goal of achieving a more nimble, agile, and efficient process of deploying smart cities projects.

24 Guidebook for Managing Data from Emerging Technologies for Transportation • Internal, cross-business unit. Other business units may also have an interest in or use for the data at hand and may be willing to support the pilot test or even offer up potential or future use cases for the data. Most of the emerging technology data sets offer potential across a wide range of use cases, especially when combined with other data sources, including agency siloed data. • Junior or mid-level. Younger generations are often more tech savvy and may have an interest in applying new technology to problems both old and new. They are often inter- ested in getting involved in using technology to address challenges and may be willing to put in extra hours for the charge. • External partners. There is a wide potential for the involvement of external partners, depending on the use case. Local partners such as cities that manage traffic signals, metro- politan planning organizations that conduct regional planning activities, and partners such as law enforcement could all be good advocates for the pilot project. Having conversations with an extended group of potential champions to identify uses and associated benefits could give the project more clout. Remember, there is strength and value in numbers. The more people interested and the more diverse the interest, the better the chance of gaining support. Collaborating with other business units or organizations could also help to defray the costs of the project by sharing resources across the groups and organizations. Step 3. Secure Buy-In from at Least One Person from Leadership for the Pilot Project In Step 3, the agency champions and team will work to secure buy-in for the pilot project from at least one person from leadership (likely the next-in-line or senior manager) within the business unit. The success or failure of the pilot project will largely depend on this buy-in and support. Securing buy-in for use of a new data set, as well as taking a modern approach to data management, could be a challenge. Because the goals or benefits of the project may not be well understood, the project may be seen as unnecessary or frivolous or the price of the data/project may be perceived to be too high. The champions and team will need to translate their project needs and goals into business needs and goals, demonstrate the value of the project, and do so within a 5-minute “elevator pitch.” In addition to the efforts made in Step 2 to select a pilot project that resonates with leader- ship, following are a few tips for increasing the likelihood of gaining buy-in: • Establish and clearly communicate the value proposition for the pilot project. • Create a sense of urgency and a fear of missing out (FOMO). • De-risk the decision by identifying and communicating risks and other potential barriers. • Know how to make the pitch. Establish and Clearly Communicate the Value Proposition for the Pilot Project The team has already selected a pilot project that meets a business need, addresses leadership pain points, and/or aligns with organizational goals. Now, the team needs to communicate clearly the value proposition for the project. The value proposition is a statement that com- municates to leadership why they should support the project (e.g., how it will solve problems) and makes the benefits of the project and the resulting products clear from the onset. When- ever possible, develop and communicate the anticipated or estimated return on investment or benefit-cost ratio, even if it relies on ballpark estimates. Having tangible numbers for leadership to mull over can help the cause. If this is not possible, consider using quantitative or qualitative/ VALUE PROPOSITIONS The smartest way to get around—Uber Rides in minutes—Lyft All your tools in one place—Slack Save money without thinking about it—Digit

Roadmap to Managing Data from Emerging Technologies for Transportation 25 anecdotal benefits from other transportation agencies to support the argument. Table 3 lists various example projects, along with their value propositions and associated questions to assist in further developing the pitch. There is natural overlap between these examples and the corresponding questions. The potential benefits will depend on the nature of the project. High-level benefits that are likely to resonate with leadership include the following: • Reduces congestion/travel times • Improves safety • Increases efficiency • Reduces costs • Increases productivity • Makes things faster • Makes things easier • Increases awareness • Improves processes/procedures • Develops new capabilities • Supports new plans/policies • Balances inequalities • Reduces negative environmental impact An example of the “improved performance, less effort” or the “addressing a challenging issue at a low cost” value proposition is the case of traffic operations and signal timing in Louisville, Kentucky. The city of Louisville now uses free crowdsourced data, low-cost cloud storage, and free business intelligence software to optimize signal timing. To demonstrate the value of this approach, the city conducted a pilot on a corridor in a fast-growing part of the city that served 40,000 vehicles per day. The city had recently implemented a new traffic control plan for the corridor to account for a 15% increase in traffic and wanted an efficient method of verifying the results. Using the crowdsourced data, a small team developed a dashboard that verified the effectiveness of the newly implemented plan, observing a 30% overall drop in traffic jam reports and a 38% drop during peak hours. The traffic jam dashboard is now available to everyone and Project Value Proposition Questions to Assist in Developing the Pitch The project will use new data in an effort to improve performance in a particular area with less effort from staff. Improved performance, less effort • What is the issue/problem? • How is the issue/problem being addressed? • Why is the current approach insufficient and where is it lacking? • Why is the new data/proposed approach superior? How can it improve over the current approach? • • What is the nature of the data (size or structure)? • How accurate and reliable are the data? • What is the cost of the data versus the potential value its use brings to the agency? • Who are the potential users of the data/resulting data products? • What are their use cases? • How might the data be used or combined with other data sets to improve organizational efficiency or performance? • How will performance be improved with the new approach? • How will performance be measured? • What quantitative or qualitative/anecdotal benefits are available from other transportation agencies that could support the argument? The project will address an issue or solve a problem that cannot be addressed or solved (either efficiently or at all) with existing data or approaches. Addressing a challenging issue at a low cost The project will allow the agency to explore and better understand how a promising new data source can be managed and used to an agency’s benefit. From exploration to improvements The project will leverage new data in an effort to develop more informed policies and plans to maximize the benefits of emerging technology services, while reducing the potential drawbacks. Maximizing benefits, minimizing drawbacks The project will make the business case for the procurement of data from a third party. Data for informed and effective decision-making Table 3. Example projects, value propositions, and questions to assist in developing the pitch.

26 Guidebook for Managing Data from Emerging Technologies for Transportation used by traffic engineers across Louisville to identify signal-timing needs. The project inspired the expansion of the concept and working with multiple data providers to enhance the signal timing of the other 1,000+ signals across the city. The efforts associated with retiming isolated signals and area-wide corridors have been greatly reduced, as staff have become familiar with the data-driven decision-making. Create a Sense of Urgency and a Fear of Missing Out Many people and organizations can be professionally conservative and may not want to risk their reputations when taking on new organizational policies or procedures. Thus, they can be averse to change and may avoid making bold decisions that might risk their reputations. It is often easier and less risky to kill an idea than it is to risk failure. However, champions can overcome this position by creating a sense of urgency and a fear of missing out (de Ternay 2018). Sense of Urgency When champions create a sense of urgency, they alert the agency (in this case, their direct leadership) why change must occur, and they begin preparing the agency for the change process (via the pilot project). Urgency is important to change, because meaningful organiza- tional change cannot occur without the cooperation of the affected stakeholders. This is why creating a sense of urgency for a needed change is the first step champions should take to gain the cooperation of leadership (Llewellyn 2015). Champions can create a sense of urgency by • Selling the value of a future state. What is the future state and why should the business unit strive to achieve this state (what’s in it for them)? • Demonstrating that the status quo is a dangerous place for leadership to remain. Clarify the consequences of inaction. • Communicating clearly, effectively, and consistently to demonstrate confidence in the pro- posed approach. • Being outcome-focused (instead of task-focused). What are the anticipated outcomes of the pilot project as opposed to the steps in the process? • Identifying causes of complacency and identifying how to eliminate them. • Getting to the point quickly. Leadership should understand the project’s goals, benefits, and the consequences of inaction within a 5-minute pitch. • Securing internal and external partner/stakeholder input and buy-in (as discussed in Step 2). The impending influx of data and the associated data management and use requirements are not seen as issues by most transportation agencies. What follows are a few statements heard from transportation agency representatives regarding big data from emerging technologies: • The systems we have right now are meeting our needs; anything new is an additional cost. • In aggregate, transportation agencies do not understand the need for a shift in how data are managed and used. • Unless you are a local city testing connected vehicles, then you are not getting a flood of data. What is the data overload you are talking about? We are not seeing it. • Even though we understand at a high level that big data are coming, we are also balancing the needs of many jurisdictions that are not facing a data problem. It is not currently a priority for most jurisdictions. When making a pitch, champions need a sense of urgency to overcome these perceptions. Fear of Missing Out Mentioning that other agencies are already using the data and the benefits they are reaping can trigger FOMO. FOMO is a pervasive apprehension that others might be having rewarding Create a compelling narrative that tells leadership why it is not in the best interest for the organization to stay in its current state. The impending influx of data and the associated data management and use requirements are not seen as issues by most transportation agencies. MISCONCEPTION “Our systems and processes are good enough.”

Roadmap to Managing Data from Emerging Technologies for Transportation 27 experiences from which one is absent. A strong case can be made by establishing that some agencies have implemented a similar project/approach with success. Telling the stories of peer agencies can also help leadership understand how the project would work in its own agency, and buy-in is more likely to be achieved if these comparisons involve agencies that leadership respects (de Ternay 2018). De-Risk the Decision by Identifying and Communicating Risks and Other Potential Barriers Leadership is more likely to support a project that is low risk and high reward. The reward is how helpful the pilot project will be to them (e.g., the project helps them solve their problems, achieve their goals, or makes them look good inside and outside the organization). The risk involves costs, the likelihood of failure, and the consequences of failing (de Ternay 2018). While inevitably there will be risks and other barriers to success, the lower the risk, the more attractive the pilot project will be. Therefore, potential risks and barriers to success, along with plans to mitigate them, should be identified and communicated up front to leadership when seeking buy-in. Leadership will be impressed with this risk-mitigation strategy and feel more confident in the overall approach. The primary risks associated with a new data project will likely involve resistance from the IT department, as well as the traditional procurement process. IT may push back on the estab- lishment of an embryotic big data test environment in the cloud, because it does not follow recognized processes or make use of approved tools. There may be perceived security issues related to storing and analyzing data in the cloud. The cost model of the cloud is also likely to be misunderstood or even rejected. The champions and team should be prepared to defend the concept and needs for the embryotic big data environment. Demonstrating that the environment will be separate from formal organizational systems and processes; it will foster the assessment, exploration, and analysis of a new data set; and it will help to demonstrate the benefits of these data to the organization. Know How to Make the Pitch There are different situations and ways to make the pitch to leadership. The pitch can be discussed informally over coffee, a meeting with other stakeholders can be arranged, or a one- pager can first be developed and shared to present the idea. The approach that will work best will depend largely on the parties involved and their relationship. A rule of thumb is to keep it low profile at the beginning; it makes things less scary and more human (de Ternay 2018). If efforts to recruit a champion from leadership are not successful, speak to others in leader- ship positions wherever possible. If no one from leadership is willing to support the project, it may be necessary to return to Step 2 and refine the project proposal or choose a new project altogether. It is not recommended to continue on to Step 4 without backing from leadership. Creating a successful project will be nearly impossible under such circumstances and, even if the project does succeed, there is no guarantee that it will lead to organizational change. It is far more efficient to spend extra time finding the right pilot, use case, and value proposition early on than it is to press forward on a project that goes nowhere. If the efforts put forth thus far fail to convince leadership to support the project, return to Step 2 and refine or select a new project that better meets the needs of leadership. It may take several attempts before landing on a project that gains buy-in. TALKING POINT Due to the size of the data and the need for flexibility and scalability, this effort will require a different approach to data storage/ management. The primary risks associated with a new data project will likely involve resistance from the IT department regarding the establishment of an embryotic big data environment, as well as the traditional procurement process.

28 Guidebook for Managing Data from Emerging Technologies for Transportation Step 4. Establish an Embryotic Big Data Test Environment After gaining support from leadership for the pilot project, the next step is to establish an embryotic big data environment. This data environment is often referred to as a “playground.” It is considered to be a test environment where there are little risks associated with the use of the data. Typically, the playground is a scalable and develop- mental platform used to explore an organization’s data sets through interaction and collabora- tion. The playground is primarily for business units to explore new data in a big data context using new data analysis tools and leveraging advanced analytical methods not currently in use by the business unit or the agency. This playground should support the needs of the pilot project and should follow as many big data best practices as possible from the Modern Big Data Management Framework so that it will be easily scalable to allow for the addition of more data sets or analytics when needed. The playground should provide the capability to work with both small and large data sets coming from both historical and streaming data feeds. It should allow users to perform data analyses, from simple analyses such as aggregation, to complex analyses requiring massive parallel processing, large amounts of memory, and high-capacity storage and input/output (I/O) capacity. The playground needs to be separated from production data warehouses to facilitate data experimentation. The playground is created using cloud services, as they allow access to large storage and com- puting power on-demand on a pay-as-you go model, which dramatically reduces the cost of running the test environment. Setting up the big data test environment within the agency will require collaboration with, support from, and approval of the agency’s IT department. It needs to be understood, however, that the goal at this stage is not to propose or force an organizational change. Rather, the goal is to establish a separate, independent test environment in which the benefits of the data can be evaluated on a small scale and the benefits of the platform can be demonstrated to others with different use cases to develop interest and drive adoption across the organization. Step 4 includes the following activities: • Establish buy-in from IT. • Establish test environment. • Take ownership and responsibility for analytical projects. Establish Buy-In from IT The environment developed in Task 4 is a test environment or playground that needs to comply with the modern data systems and management approach described in Table 1 and the big data best practices and recommendations presented in the Modern Big Data Manage- ment Framework (e.g., needs to be scalable, flexible, and allow access to a range of users). If IT demands that a traditional, more rigid, and controlled approach be adopted instead, the test environment simply will not work or be successful. There will almost certainly be a need to share the big data knowledge gained in Step 1, the information contained in this step (Step 4), and the recommendations in the Modern Big Data Management Framework with IT, especially if this is the organization’s first foray into big data. A clear understanding of how hardware, software, and cloud pay-as-you-go models differ from traditional procurement is needed, as they are likely to be quite different from what the IT team normally encounters. For example, most cloud-based data storage and process- ing services charge monthly usage-based fees that can be difficult to compare with the costs of purchasing and maintaining local hardware. Explaining that the computing needs of the test When planning the big data project and associated environment, make sure to refer to the industry best practices and recommendations in the “Store” section of the Modern Big Data Management Framework in this guidebook. Also, refer back to Figure 3, as it visually demonstrates the differences between traditional data architecture and modern data architecture.

Roadmap to Managing Data from Emerging Technologies for Transportation 29 environment are of an unpredictable, elastic nature, and that they can peak to levels that are higher than all the computational capabilities found in the organization for a few hours, will be needed. Preparing an estimate of data access needs ahead of time will also make it easier for the IT team to understand and compare the costs involved. It may also be helpful to recruit at least one IT professional to be integrated into the pilot project team to support close collaboration and clear communication. If this approach fails or if the cloud is not allowed, the test environment technically could be developed on premise by either deploying appliances from typical vendors such as IBM or Oracle or deploying an on premise cloud setup. It should be noted, however, that this approach is highly discouraged. Building a cloud-like environment on premise is a large and challenging task. Such environments require advanced server clustering expertise that is often not found within transportation agencies and that is expensive to acquire. They are also much more expensive and time-consuming to deploy, operate, and maintain than their cloud counterparts, which as a consequence lead them to be deployed under strict control policies that limits data experimentation. The deployment and management of an on premise, cloud-like environment require a significant amount of resources to support, including but not limited to: • Purchase of a large quantity of commodity servers and network hardware to build a cluster. • Constant replacement of the cluster hardware due to failure, obsolescence, and scaling. • Deployment and constant maintenance of the cluster software stacks. • Real-time monitoring of the cluster software stacks and hardware resources. • Real-time monitoring of users and data to ensure openness and security at the same time. These challenges are the reason why online cloud services were created; that is, they were created mainly to allow organizations small and large to share the burden of managing large computer clusters. The benefits offered by online cloud environments are numerous and make the adoption of an on premise cloud reasonable only for very large systems such as global banking or video streaming. Below are some of the benefits that online cloud service providers offer over on premise cloud: • Quick adoption of new hardware technologies such as graphical processing units, field pro- grammable gate arrays, and solid-state drives • Inexpensive data storage • Very large computing capabilities • Built-in data and user management • Large choice of software implementation from vendors to open source • Shared management and security While cloud architecture is highly recommended, there may be situations where a cloud environment is not feasible, for example, where data use is reliant on software that cannot run in a cloud environment because of compatibility (such as aging software) or legal reasons. Such situations are increasingly rare, however, leaving most transportation agencies free to pursue the benefits of adopting modern cloud-based data architectures. The establishment of a big data playground will not come without barriers. Below is a list of barriers that business areas should expect to encounter when establishing a data playground: • The pay-as-you-go cost model for cloud computing is often perceived as an issue (resulting from a shift from centralized IT procurement/billing to decentralized/distributed billing of groups across an organization that are using the cloud services). • Data privacy concerns due to a lack of confidence or knowledge with regard to securing data in the cloud. “IT management wants a clear definition of why you need a data source and what parts of that data source you’ll be using. It took us a few years, but we finally convinced them that people need to see the data first; they can’t just read a 70-page API document and expect to understand what they need.” —Kentucky Transportation Cabinet

30 Guidebook for Managing Data from Emerging Technologies for Transportation • Resource concerns originating from not knowing the amount of resources needed to estab- lish the environment or how much it will really cost in addition to maintaining the current infrastructure. • Resistance to establishing a new data environment when there is already an infrastructure in place with people maintaining it. Depending on the business model, IT teams may push back on big data development. This pushback not only stems from a lack of understanding but also from a fear that big data adoption will reduce the team’s size or importance to the organization. Finding the right funding structure to where big data initiatives can use modern technology without fear of eliminating IT payroll positions is a long-term problem that may require extensive retraining or reorganization. The adoption of a cloud-based data playground should not be dependent on the adoption of cloud at the organizational level (this will come much later). Leadership champions should work on obtaining IT department support in the short term so that the pilot project development is not delayed while a long-term solution is sought. The process of obtaining buy-in from IT will most likely be met with negation, and it is expected that it will take time and require the involvement and support of reputable and highly trusted individuals within the organization. Common Misconception: Data Stored on the Cloud Are Less Than Data Stored Locally Storing data on a cloud service does not make it any less secure than storing it locally. In fact, data are often more secure on cloud services, because modern cloud service providers employ large teams of cybersecurity experts who focus on securing data as a critical aspect of their business. In contrast, most transportation agencies are under budgetary constraints that limit how much cybersecurity expertise they can develop and retain in house. Major cloud service providers therefore have stricter security procedures that employ more up-to- date algorithms than most transportation agencies. Establish Test Environment Establishing a big data test environment/data playground differs from the traditional IT system deployment that agencies currently follow. Allocating a server or a relational database on the cloud as a data playground will not achieve the intent and goals of the playground, as this approach will inherently limit the amount of data that will be able to be explored as well as how and how fast they can be explored. This approach will also require data preparation that will need to be done before moving data into the playground, which will dramatically reduce the value that can be derived from them. Rather, the data playground needs to be established in a more flexible yet controlled fashion that is split into two independent layers: a data storage layer and a data processing layer. Data Storage Layer The data storage layer is the part of the data environment where the data to be explored are stored. This part of the data environment should be implemented on a cloud storage service to benefit from its ability to easily and inexpensively store, organize, and secure very large data sets and make them available to many different data processing software applications.

Roadmap to Managing Data from Emerging Technologies for Transportation 31 Once the storage service is acquired, data of interest need to be moved to the playground storage. This can sometimes be a problem, as some organizations do not trust that cloud ser- vices can store data securely, especially when data are, or are perceived to be, sensitive. It is then essential to assure management, with the support of IT, that the data can be stored securely on the cloud. A solution that is often implemented to help secure data and alleviate the fear of exposing the data in a public cloud environment is a virtual private cloud. Virtual private clouds are on-demand, configurable pools of shared computing resources allocated within a public cloud environment and are designed to provide a certain level of isolation between the different stakeholders using the cloud resources. This solution should be strongly considered to reduce the risks and eliminate existing fears (perceived or real) of exposing sensitive data in the cloud. While the data playground only uses a simple storage solution to store data sets, the way in which the data will be stored will need to satisfy several conditions in order for the data play- ground to provide the most benefits to its users and avoid becoming messy (i.e., a data swamp): • The data need to be stored as is or raw, that is, unedited or transformed from the way they were provided to the agency (from sensors or third-party APIs). This is rather important, as early transformation of the data may inadvertently filter out data that may be perceived as useless but that in fact are essential to the exploration of the data and the establishment of their veracity and value. • The data need to be stored under strict “read-only” privileges for all users except indi- viduals in charge of uploading and managing the data. Indeed, within the playground no users should have the ability to alter or delete the uploaded raw data, as they represent the ground source of truth for each individual data exploration effort and need to remain unaltered. • The data need to be organized logically so that they can be easily found and understood by analysts. To do so, a taxonomy created from keywords describing each data set and docu- mentation describing each data set and its content should be developed. The taxonomy will then be a basis for a simple folder structure into which each data set will be stored. The resulting folder structure and documentation can then be shared within the playground to help users understand which data sets are where and how they are structured. This goes a long way to providing a head start to most data analysis projects. • Should some of the data be sensitive, access should be restricted to a few users only. While a virtual private cloud implementation will protect the data from other cloud users, some data will also need to be protected from cloud users within the organization. There are two options available to secure sensitive data in the data playground storage: folder access restriction and data encryption. Depending on the data and the nature of the processing of the data, either or both solutions can be implemented. Folder access restriction should be considered in the taxonomy/folder structure so that limiting access to a folder where sensi- tive data are stored does not also limit access to non-sensitive data. Data encryption should be considered when there is a need to provide a clean version of a sensitive data set so that analyses can still be performed on the data without risk. Encryption algorithms should be selected carefully, as many have become compromised over the last few years. Once the data are stored and organized into the data playground storage, they are now ready to be explored using cloud analytics services. Data Processing Layer The data processing layer is the part of the big data test environment where the data sets stored in the data playground storage are processed to create new data that will then be stored back into the data storage layer. Cloud storage services have allowed Delaware DOT to more easily integrate their data sources (which was a challenge to do in house), store all the data in a single repository so end users can access them directly, and leverage cloud tools that they otherwise would not have access to. Understanding the concept and utilization of a cloud storage layer (i.e., data lake) is extremely important at this stage. Agencies that do not fully grasp the need for storing raw, unprocessed, data are far more prone to encounter mistakes or pitfalls as the architecture and use cases mature over time. One agency described the data lake as the “big undo button.” This agency has been forced to deal with several versions of the same data set due to the inability to implement a data lake much sooner in the process.

32 Guidebook for Managing Data from Emerging Technologies for Transportation Again, as with the playground storage layer, the processing layer should be able to support multiple data analysis tools as needed to explore the data in the storage layer. Providing a rela- tional database or traditional statistical software to users in the playground will not suffice, as they will not provide the analytical capabilities needed to process the large data sets in parallel or the unstructured and semi-structured data sets stored in the storage layer. They will also not support a set of analyses varied enough to explore data sets, as traditional analytical solutions often can only apply a limited set of analyses to well-curated data sets. The data processing layer should not prescribe specific analytical tools; rather, it should allow users to pick and choose the tools they would like to use to analyze the data. This approach is a complete departure from traditional IT management and methods used to control data pro- cessing; but in order to explore the potential value of the raw data sets, multiple analysis tools will be needed to discover the characteristics and hidden patterns in the data and develop those into data analytics pipelines. Furthermore, the questions of interest and the skills of individual users will also affect the choice of solutions that should be used for each project. The data processing layer should then make available to its users as many tools as possible, ranging from the tools provided by the cloud service provider to open source tools: • Cloud provider services solutions are the easiest way to deploy analytical solutions on a cloud environment but are designed to lock the user in to the specific cloud. • Open source solutions are ideal to implement cloud analytics solutions without risking a cloud vendor lock, but they require additional skills, such as container development and management, to be deployed. The use of non-cloud provider, commercial data analysis solutions is not recommended in the data processing layer, as their costs are often non-negligible and make them better suited for a production system rather than a system meant for exploration. Indeed, cloud provider analytics solutions offer flexibility at a relatively low financial commitment, and open-source technologies can also avoid the need to pay substantial sums for every new component. By following these recommendations, the cost of the data playground can remain reasonable without forcing the data processing layer into a commercial analytical solution. Take Ownership and Responsibility for Analytical Projects Managing and controlling the playground data storage and processing layers may also raise concerns with IT and management. Indeed, with many users developing multiple data analyses and using multiple analytical solutions, performing data management across the data play- ground can be seen as overly complex, overwhelming, or even impossible from a traditional point of view. Risk of the environment becoming out of control and overly costly could com- promise its flexibility or even existence after only a few months. To remain in control of the data playground, a different approach needs to be taken that combines the following: • Complete ownership of the data project by the champion and team. They should be held responsible for development, maintenance, and expenditures and, only in exceptional cases, should they place the burden of supporting the project on the IT department. • Clear control of who has access to what data, including real-time tracking of who accesses and processes what data and the implementation of alerts and triggers to avoid abuse and violations. This can be implemented by the data governance team or the IT department and should be done using the activity log analysis tools made available by the cloud provider. • A clearly defined starter budget for each data analysis project in the playground and a defined process for additional funding to limit excessive spending. The data processing layer should not prescribe specific analytical tools; rather, it should allow users to pick and choose the tools they would like to use to analyze the data.

Roadmap to Managing Data from Emerging Technologies for Transportation 33 By implementing such a big data test environment, agencies will be able to safely explore new data and achieve many benefits without massive outlay. By investing in the development of team knowledge, giving them access to the type of data environment and responsibilities typi- cally under the authority of IT departments, they will also start to develop a data understanding from within the organization that can begin to foster a culture of data. A major concern shared among almost every transportation agency is the liability or risk involved with handling sensitive data and PII. Many agencies protect themselves from PII by immediately aggregating away sensitive information and permanently deleting the original data. This approach effectively removes the risk of a data breach, as there is no sensitive data to be compromised. It also removes the risk of being subpoenaed, as personal information cannot possibly be identified from data that no longer exist. The downside of this approach, however, is that if anomalies are found in the data, it may be impossible to determine whether the values are legitimate outliers or data collection errors. If the source data are preserved in an encrypted format on a separate, restricted-access server, a similar level of protection can be achieved while avoiding the loss of data usability. Common Misconception: Transportation Agencies Must Regularly Delete Data to Keep Data Storage Affordable It has been good practice to regularly review data for archival or destruction in order to keep data storage costs low, even to the point that data management life cycles typically included a step to destroy or purge data. In the world of big data, destroying old data no longer needs to be a focus. Most cloud storage providers will automatically transition less-used data to archival storage with no input required from the data owner. Furthermore, under the usage-based fee structure that is common among cloud providers, the less data are used, the less they cost to store. These factors, along with an overall decrease in cloud storage costs and an overall increase in the value of data, create an environment where retaining as much data as possible is much more feasible than it traditionally has been. Case Study: On Premise Versus Cloud The Kentucky Transportation Cabinet started the road to big data in 2014. Cloud computing during that era was still considered to be too new and too risky for government agencies. As such, for the first 2 years, the team developed and scaled the system using on premise architecture and expertise developed in house. After a staffing change on the development team, and the need for additional resources, the proof-of-concept pilot project graduated to an official project within the Office of Information Technology. Soon after hiring more staff in early 2017, the agency decided to continue using on premise architecture. Even though cloud computing had matured by this point, it still was not an approved architecture by the centralized IT department. Instead of challenging that policy, the development team was told to continue using and scaling on premise architecture. Executive leadership still did not fully understand the benefits and potential of big data or further benefits from a cloud architecture. During the process of scaling data inputs and processing, the on premise method of scaling big data started to experience very serious pain points. As layers of complex tools and additional data were added, the system stopped operating as expected, and the team was forced to dedicate entire sprints (3-week time periods) to system tuning and optimization. Issues such as small files and CPU/RAM resource management for processing became extremely problematic, taking valuable time away from use case development. An outside organization was eventually hired for the sole purpose of managing the network servers so the developers could spend more time working toward business use case functionalities. In 2019, the development team finally received the go-ahead to proceed with a proof of concept to move the real-time data pipeline to the cloud.

34 Guidebook for Managing Data from Emerging Technologies for Transportation Step 5. Develop the Pilot Project Within the Big Data Test Environment/Playground In Step 5, the champions and team will work within the big data environ- ment established in Step 4 to develop the pilot project. The development itself should be viewed as an iterative process with multiple feedback loops and revisions expected as the project evolves. The outputs of Step 5 will include data products (e.g., visualizations, dashboards, or maps) that demonstrate the value of the data to the department. In Step 5, the champions and team should: • Develop/ensure availability of the right expertise. • Develop the project applying a data science perspective. • Iteratively develop/improve the project and associated outputs. Develop/Ensure Availability of the Right Expertise To support the development of the big data environment and pilot project, an inter- disciplinary team must be gathered. This team should include cloud architects, modern data management specialists, big data analysts, and business area specialists. This team can consist of in-house personnel, contractors, university personnel, or all three. When choosing which resources to engage in the development of the project, keep in mind that while the pilot project should provide benefit to the organization, that is not its only goal. This project will eventu- ally serve as evidence when arguing for the adoption of modern big data management prac- tices across the organization. Therefore, there is an inherent benefit to using in-house staff resources as much as possible; they will naturally develop into data champions as they work on these projects. This in-house knowledge, experience, and enthusiasm for big data often proves invaluable to generating the momentum needed for organizational change. There are options for developing/ensuring the availability of the right expertise to develop the big data pilot project within the test environment. These options include • Developing the expertise with staff already in-house • Acquiring/hiring new staff with big data expertise • Contracting with trusted contractors or universities • Contracting with big data experts/consultants Table 4 provides a high-level overview of some of the pros and cons of each of these options for developing or working with different resources to develop the big data pilot project. How an agency decides to obtain the right expertise will be dealt with on an agency-by- agency basis. Some agencies may prefer to develop the expertise in house. Developing the expertise in house requires a long-term commitment from the agency to not only develop the appropriate job descriptions and skill requirements [e.g., modern big data pipelines; modern tools (machine learning or natural language processing); Python, JSON, object-oriented programming, or applied statistics] with commensurate and competitive salaries but also to ensure personnel receive on-going training to keep their skills up to date given the fast- changing environment. As private companies can adjust more quickly to changes, other agen- cies may prefer to contract the necessary resources to big data experts/consultants; however, it is strongly recommended that in-house staff develop a core set of skills through education and/or experience to provide oversight, ask the right questions, and verify the quality of data and work. Agencies are cautioned regarding the use of their traditional, on-call contractors, as many have also not yet developed or hired staff with the requisite skill sets to develop big data systems and projects. “Training in-house staff is a major issue, and we will always argue that in-house staff need to be trained, even if the contractor performs the work or builds the solution. This cannot be understated, even if your role is just to lead and advise. We put in the hours, learned the technology, and designed the system instead of just asking for things without understanding what we were asking for. As a result, the agency has benefited greatly.” —Kentucky Transportation Cabinet

Roadmap to Managing Data from Emerging Technologies for Transportation 35 Develop the Project Applying a Data Science Perspective Given the overall goals of the pilot project, it is important that the team approach the project from a data science perspective (as opposed to a more traditional approach), that is, extracting value from the data. Project development steps applying a data science perspective include the following: • Identify the goal of the project. • Collect raw data. • Process and clean the data. • Perform exploratory data analyses. • Build data science pipelines. Each of these project development steps is discussed in more detail herein. Identify the Goal of the Project While the champion and team likely accomplished at least some of this in Step 2 of the Roadmap, this first step of project development involves developing a clear sense of what the team is trying to achieve through analyzing the data sets in question. Questions answered at this stage include (Turner 2019): • What decisions need to be made from the data? • What questions does the team wish to answer? Resources Pros Cons Training or hiring in- house personnel The same resources that built the system will support the system. Skills developed during the pilot will be retained for other projects. Resources become data champions within the organization. New staff hired with requisite skill sets can be immediately effective and productive. Difficult to attract big data professionals to transportation agencies (e.g., salaries not competitive with private sector). Can be costly to hire for big data skill sets. Training for big data skill sets requires time and dedication. Training in-house staff make them more marketable to other organizations (may lose them after they are trained). Trusted contractors and university partners Trusted/vetted resources. May be able to get things done more quickly if in-house resources are limited and there are competing priorities. Usually local or pre-approved. Cost accountability. Do not often employ staff with the requisite big data skill sets. Cannot verify quality of work without some in-house expertise. High turnover rate. University students involved in projects have little business experience. Big data experts/ consultants Already possess requisite big data skill sets. Understand languages and tools. Cannot verify quality of work without some in-house expertise. High turnover rate. Not vetted. Not usually local or pre-approved. Comes at a price, that is, expertise more widely available but still not common. Table 4. Pros and cons of different potential support resources. Be sure to reference the Modern Data Management Framework section of this guidebook for more details.

36 Guidebook for Managing Data from Emerging Technologies for Transportation • For answers, what level of confidence would the team be happy with? • Can the team formulate hypotheses relating to these questions? What are they? • How much time does the team have for the exploration? • What decisions would the team like to make from the data? • What would the ideal results look like? • How is the team to export and present the final results? The team should consider brainstorming, whiteboarding, and workshopping during this stage to answer these and other questions to develop a clear line of sight to the overall objec- tives of the project. Collect Raw Data The next step in developing the project is to identify and collect data that will help to pro- vide insights needed to formulate a solution for the problem at hand. This part of the process involves thinking through what data the project team will need and how they can be obtained. The latter is often the least easy of the two. Data can be obtained in two ways: by obtain- ing historical data sets and by collecting data directly from real-time data feeds. Unlike some traditional data providers, most modern third-party data providers generate so much data that they do not provide historical data sets, just a live data stream. The project team will need to make sure to collect enough data from the data streams to ensure a good understanding of the data collected. It is also useful to ensure that the team has a bigger picture understanding of what is there. Questions asked during this stage include (Turner 2019): • What is the size of the data? • How many files are there? • To what extent does the data originate from different sources? • Automated exports or manual spreadsheets? • Does the data have consistent formats (dates, locations, etc.)? • What is the overall data quality? • What is the level of cleaning required? • What do the various fields mean? • Are there areas in which bias could be an issue? Answering these and other questions about the data can aid the team in deciding how to go about the analyses and determining which aspects of them are most important for the analyses. As a start, organizations should consider adding the following data sets to their data playground: • Agency traffic/speed data • Road weather information systems (RWIS) data • Operations/traffic management center data (incident reports, etc.) • Social media and/or crowdsourced data (e.g., Waze) • Third-party vehicle probe speed data Process and Clean the Data Once the raw data have been collected and a copy has been stored, the next step is to trans- form and clean a working copy of the data before performing any in-depth analyses. In many cases, raw data can be quite messy (duplicate values, missing values, corrupted values, non- standard timestamp formats, time zone differences, or unexpected data in columns or fields).

Roadmap to Managing Data from Emerging Technologies for Transportation 37 As such, this processing and cleaning can take a long time and can be relatively tedious work, but the results are well worth the effort. This step includes (Turner 2019): • Combining all data into a single, indexed database. • Identifying and removing data that are of no relevance to the defined project goal. • Identifying and removing duplicates. • Ensuring that important data are consistent in terms of format (dates, times, or locations). • Dropping data that are clearly not in line with reality; these are outliers that are unlikely to be real data. • Fixing structural errors (typos or inconsistent capitalization). • Handling missing data by either dropping or interpolation. • Labeling and organizing the data efficiently so that there is no confusion about what is contained within the data or what the data mean. Processing and cleaning can be a bit of a manual discovery process and are greatly facilitated by data analysis experience and domain knowledge expertise. To uncover errors, the project team will want to look through various aggregates and plots of the data and assess if the values make sense. Once uncovered, depending on the findings and problem to be solved, the project team will likely need to correct, rename, or remove data; however, not all errors need to be removed through this process. In some cases, the errors need to be considered as trusted data to solve certain problems. Additionally, the project team may want to enrich or augment the data sets by adding new values needed to solve the problem. This is often done by joining the data with another data set containing the data to be added. An example of enrichment is adding historical weather data to incident data by joining the two data sets by location and time. Perform Exploratory Data Analysis Once the data have been processed and cleaned, the project team can begin inspecting, exploring, and modeling the data to find patterns and relationships that were previously unknown. This process is heuristic and requires a lot of poking and testing to uncover patterns from the data. This may include (Turner 2019): • If there are time-based data, explore whether there exist trends in certain fields over time, usually using a time-based visualization software. • If there are location-based data, explore the relationships of certain fields by area, usually using mapping software. • Explore correlations (R-values) between different fields. • Classify text using natural language processing methods. • Implement various machine learning techniques to identify trends between variables/fields. • If there are many variables/fields, dimensionality reduction techniques can be used to reduce these to a smaller subset of variables that retain most of the information. Here again, experience in big data analysis and domain expertise can be of great help. The difficulty of this step is to come up with ideas and tests that can quickly lead to valuable patterns so as not to lose too much time and money exploring areas of the data that do not provide value. This can be difficult when the data sets are very large, very small, or unstructured and when the domain they cover is not well known. From the uncovered insights and patterns, the project team can now use them to develop a more in-depth analytic pipeline. This step can sometimes reveal patterns in the data that may require the data to be trans- formed and cleaned in a different way than was done in the previous step, in which case the process-and-clean step should be repeated.

38 Guidebook for Managing Data from Emerging Technologies for Transportation Build a Data Science Pipeline This step focuses on developing a prototype analytical pipeline based on the findings of the previous steps. A data science pipeline is a sequence of processing and analysis steps applied to data for a specific purpose. Figure 6 shows an example of a modern data pipeline, developed from data in a data lake. The development of a data science pipeline is composed of two main sub-processes: one sub-process that first applies a refine, transform, and clean process to the raw data, then a subsequent sub-process that applies descriptive, inferential, predictive, prescriptive, or causal analyses to the transformed and cleaned data to generate a resulting data product. After the raw data are cleaned, better understood, and prepared for analyses and the explor- atory analyses are conducted on the data, the team will likely be ready to develop a pipeline that will generate a specific end product, such as a report or dashboard, which may run auto- matically to continuously inform a specific business unit function. Once established, this data pipeline will automatically pull in the appropriate raw data, transform the data as necessary, and apply pre-defined and pre-established data analytics techniques to develop the end data product. When developing the analytical pipeline, the team should consider the use of the following: • Open source data science friendly programming languages such as JAVA, Scala, Python, R, and Go as opposed to traditional programming languages such as C# or Visual Basic, as these support few data science libraries. • Open source data science frameworks such as Pandas, Numpy, Matplotlib, Jupyter Notebooks, MapReduce, scikit-learn, Tensorflow, and Keras as opposed to commercial relational data- base modules, as these are too costly, limited, and slow to evolve prototype pilot projects. • Advanced data science algorithms such as stochastic gradient descent, random forest, Lasso, ElasticNet, NaiveBayes, and deep learning, as opposed to classic statistical techniques such as linear regression, logit, and least squares. The project team should use whatever they can afford to build a pipeline that is of value. Once the pipeline is successful and capable of demonstrating a value product, the project team should prepare a “data story,” combining qualitative insights and quantitative analyses designed to move people to action. There is no need at this point to spend resources “above and beyond” those necessary in an attempt to perfect the data pipeline or the products; the project is still at a prototype stage and these extra resources may not provide a good return on investment. Data science and cloud solutions are evolving constantly and at a rapid pace; as such, a prototype pipeline needs to be modified constantly to keep up with these advancements. Therefore, a pipeline that is “good enough” is more resource efficient than a “perfect” data pipeline even at the production stage. Most modern data science pipelines are ephemeral, that is, they are built, deployed, and run Once established, this data pipeline will automatically pull in the appropriate raw data, transform the data as necessary, and apply pre- defined and pre-established data analytics techniques to develop the end data product. Figure 6. Example of a modern data pipeline. Delaware DOT found the cloud was more efficient both in processing speed and in the available pipeline tools. Even with experienced IT personnel on staff, developing and maintaining a large suite of data tools was less efficient than simply using the tool suite that the cloud vendor had already built, tested, secured, and optimized. One data process that used to take days for their on-premise hardware and software to process was completed in an afternoon on the cloud.

Roadmap to Managing Data from Emerging Technologies for Transportation 39 at start time and dismantled at stop time, leaving many opportunities to update them between runs without ever being perfected. Iteratively Develop/Improve the Project and Associated Outputs The development of the project is likely to be more successful if the project is treated as an iterative process with multiple feedback loops and revisions taking place throughout the development process. This iterative development process will serve as a “test run” of a larger organization-wide evolution of data management practices, where challenges are encountered and overcome, and lessons are learned and documented by a small and dedicated team before they are later faced by the agency as a whole. While being able to obtain valuable results from the developed pipeline is important, the ability for these results to be useful to the rest of the organization is just as important and even essential to adoption of more modern data practices within the organization down the road. As a project becomes more and more polished, it may be prudent to add additional analyses, both to make the project more successful and to experiment with different techniques. Many big data analyses require a relatively clean and complete data set in order to be effective, so this is likely to be the first time they can be attempted. Applying such methods successfully will help the project display not only the benefit of modern data handling practices but also the benefit of modern data analyses that are only possible with big data. This is also a great time to experi- ment with additional data sources where applicable. Seek out additional data sources that can be used to augment the original data. If there are additional data sources that are unstructured, ill formed, or otherwise difficult to work with, now is a good time to attempt to clean and pro- cess them. The most convincing arguments for adopting modern data management practices can be made by pointing to applications that are difficult or impossible to achieve without said practices. This stage of development is an ideal time to stretch the team’s limits and attempt challenging data work, because the cost of failure is low and the rewards for success are high. The lessons learned and challenges overcome during this phase will improve the pilot project and future projects to come. Putting analytic pipeline results back into the data playground storage layer and sharing them with others as a potential data source to solve new problems is also crucial. This is how the organization will be able to quickly build an advanced data practice and adopt a more data- driven approach. An iterative development process will serve as a “test run” of a larger organization-wide evolution of data management practices, where challenges are encountered and overcome, and lessons are learned and documented by a small and dedicated team before they are later faced by the agency as a whole. Case Study: Negotiating Technical Contracts for Data Services The Los Angeles County Metropolitan Transportation Authority (LA Metro) initiated a Mobility on Demand (MOD) project to provide first mile/last mile transportation services. To develop this project, LA Metro partners with Via, a third-party shared-ride transportation company. Under this partnership, Via not only provides shared rides to LA citizens but also provides data services, including collecting, storing, cleaning, preparing, and aggregating data associated with the service. Via also generates data visualizations and makes them available to LA Metro via an online dashboard. In order to make this project and partnership a success, LA Metro needed to overcome significant challenges with contract development and data sharing. (continued on next page)

40 Guidebook for Managing Data from Emerging Technologies for Transportation Case Study: Negotiating Technical Contracts for Data Services (Continued) At the time the partnership agreement was negotiated, few employees at LA Metro had any experience with highly technical IT contracts and had to work closely with legal counsel to fully understand and navigate all the details. Both LA Metro and Via wanted to maintain maximum control over the data while limiting liability. Via did not want to share raw data to avoid inadvertently disclosing trade secrets, while LA Metro did not want to be held liable for potential security breaches on systems over which they had no control. It took a considerable amount of time and effort, but LA Metro was able to successfully navigate this process and draft a mutually beneficial contract for both parties. Data sharing was another issue that took time and attention to resolve. At the time the agreement was developed, LA Metro did not yet have any organization- wide internal data-sharing policies. As such, they had to coordinate individual tasks with each data silo owner, many of whom had reservations about sharing their data with an outside company like Via. Talking to these stakeholders to individually address their concerns about data security, appropriate data use, and competent data analyses was time-consuming but proved vital to the success of the project. Case Study: Building Data Knowledge In their transportation technology strategy document Urban Mobility in a Digital Age, Los Angeles DOT (LADOT) recognizes the need for modern approaches to use vast amounts of data, noting that “LADOT and other city departments must access and understand underlying data to make strategic decisions about prioritization. While the city generates large volumes of data, it lacks comprehensive, quality data to plan for all modes, evaluate existing programs, and understand how to adapt” (Hand 2016). Since that document’s publication in August 2016, LADOT has worked to select projects that provide specific benefits while advancing their data management practices generally. One of these projects was a data inventory. When working with the city council and mayor to fund broader transportation technology initiatives, LADOT found it useful to begin with a data inventory to assess how the city’s many departments and business units obtained, managed, and shared data. This inventory resulted in immediate benefits to the city by identifying areas where data efficiency could be improved, including business units that heretofore had been operating with no concept of internal data sharing. The inventory has also helped support the building and expansion of other data related initiatives by the DOT.

Roadmap to Managing Data from Emerging Technologies for Transportation 41 Step 6. Demonstrate the Value of the Data to Other Business Units Once the project has developed to a point where it generates real value for the business unit, it is time to share it with others outside the business unit. The goal of Step 6 is to share the approach, outcomes, and value of the data, project, and resulting data products to develop interest and buy-in on the project and approach from these groups. To begin to drive change within the organization, others need to know about the project. In Step 6, the champions and team should: • Build support for the data and project horizontally within the organization. • Use the data to tell the story of success. • Get others involved in sharing and using their data within the test environment. Build Support Horizontally Depending on the organization, it is most likely the leadership champion’s role (with support from the pilot project champion) to share and demonstrate the value of the pilot project to others within the organization. A good place to start is with other mid-level/branch managers that may have an interest in the data, project, and data products (or similar products) for their own business areas. So that a larger potential group of stakeholders is reached in this step, volunteer to give presentations or demonstrations at various organizational events (e.g., coalition meetings, district meetings/conferences, or statewide maintenance conferences). Be careful with sharing the project too soon with those at the executive level. Sharing the project vertically (Step 8) is likely to be more successful after multiple iterations of Steps 2 through 6 have generated more successful use cases. Use the Data to Tell a Story The data (and the resulting data products) are the best ammunition for selling the benefits of the data and the approach. Using the data and the resulting data products, the project team will need to craft a compelling story using understandable and persuasive visualizations that tie the insights uncovered in the data to the ability to address an issue or solve a problem of the business units. Case Study: Building Data Knowledge (Continued) One success story resulting from these projects has been the formalization of a Bureau of Transportation Technology. The growth of LADOT’s nascent Bureau of Transportation Technology is sustained by making efficient use of available resources and avoiding potential budgetary or administrative roadblocks. Six positions were moved from the IT department into the Bureau as a means of quickly onboarding skilled talent. When partnering with consultant services, the internal team prioritizes knowledge transfer to augment internal training and development.

42 Guidebook for Managing Data from Emerging Technologies for Transportation When seeking buy-in from other business units, create a storyline for the project that includes the following: • Introduction (characters and setting are introduced). Explain who/what group within the organization is involved in the pilot project, their roles and functions, and their goals. • Problem (main characters face a series of conflicts). Detail the challenges and roadblocks currently keeping the characters from effectively/efficiently completing their functions and meeting their goals. • Climax (the most exciting part of the story). Describe the new and exciting data, the potential for the data to improve work functions and meet goals, how the data pipeline was established and what data were combined, how the data were assessed, what was discovered (about quality and potential uses of the data), new and exciting tools and analyses conducted, and challenges and how they were overcome. Visualize the data to enhance the story. Make the visualizations beautiful and easy to understand. • Resolution (events leading to the end of the story where the outcome is revealed). Reveal the resulting data products and how they resolved the problem. State that big data sets, including messy and “dirty” data, and big data tools and analytics techniques can be used within the agency and that this approach adds value by providing insights that the agency simply would not have without the data and the approach. • Conclusion (the end of the story, judgment/decision reached). Conclude with how these data and this approach have fundamentally changed the way the characters view data and how they will use data in the future. Deduce that the same approach could work for other business units (and the organization as a whole). The results should be shared in two ways. The first way is to create the compelling story that will be used to communicate to management and other business areas. The second way is by sharing the results with others in the data playground environment so that they can be used as a data source for other analytics projects. Get Others Involved in Sharing and Using Their Data Within the Test Environment A primary goal of Step 6 is to generate interest in the data and the embryotic data environ- ment that results in more data and additional use cases and pilot projects. This Roadmap to big data represents an organic, bottom-up approach that relies on an iterative process to grow use cases and pilot projects and to build a stronger case for the data and the big data environment across the agency. Figure 7 illustrates this iterative process. After demonstrating the value of the pilot project to various business units and generating interest from one or more of these groups in Step 6, loop back to Step 2 to identify new pilot projects (and associated data) specific to these groups (this iterative process is also illustrated in the full Roadmap in Figure 4). If necessary, secure buy-in from business unit leadership (note, however, that given the process, these new pilot projects might be more top-down initiatives). Once new use cases and pilot projects are iden- tified, work with the other business units to develop their projects and data products within the data environment and subsequently share these results across the organization. It is important at this point to not try to force change. Let others see the value of the data and the environment and envision how they might develop a similar pilot project for their business unit. Enough iterations will prepare leadership champions to market the data, data products, and data environment to organizational executives in Step 7. Some agencies may find that they are ready to go to executive management after the first pilot project, while other Delaware DOT sees firsthand the value of the data they have been collecting and building. They realize now more than ever the importance of storing raw data and then integrating and analyzing these data to drive decisions rather than basing decisions on experience, conjecture, or gut feelings. They are now able to back up decisions with real data and are helping others within the department, as well as partners outside the department, to better understand this approach. “We didn’t force anyone to change as we implemented the project and explored the data within our test environment. We simply asked if we could copy others’ data into our system. It was a very non- confrontational approach. Most people were flattered that we wanted their data and they’re now impressed with the outcome.” —Kentucky Transportation Cabinet

Roadmap to Managing Data from Emerging Technologies for Transportation 43 agencies may iterate through multiple projects before they are ready. The structure, culture, and relationships within each agency will largely drive this decision. If the efforts put forth thus far fail to result in interest or support beyond the business unit, return to Step 2 and refine and expand on the current project and/or develop a new pilot project with data from other business units to better meet their needs and appeal to them. It may take several attempts before landing on a project that gains widespread buy-in through- out the organization. Remember that the business unit can still derive value from the data and the data environment even if widespread interest is not generated immediately or even after several iterations or projects. With continued development of use cases and projects within the business unit, eventually the team will have more to communicate, and others within the organization will catch on and want to learn more. Once it takes off, it will be impossible not to talk about it. KYTC first implemented Elasticsearch, an open source tool for storing and discovering data, simply so users could browse the data. This helped with resource management and allowed users a safe, read-only, environment for accessing, interacting with, and reporting on data. The visualizations of the data also greatly helped to sell the idea of investing in big data to upper management. “Everyone who has interacted with our system has had something to say about it to someone.” —Kentucky Transportation Cabinet Figure 7. Iterative process to generating interest and buy-in horizontally across the organization. Case Study: Iterative Success and Growth For KYTC, the original proof of concept (pilot project) for big data was to develop a real-time snow and ice decision support system, with the goal of making efficient use of materials, trucks, and labor. The first iteration of the system was based on just three data sources: Doppler radar, snowplow AVL, and Waze. After demonstrating the ability to process a variety of data sources with different volumes and velocities, the project was considered a success and (continued on next page)

44 Guidebook for Managing Data from Emerging Technologies for Transportation Case Study: Iterative Success and Growth (Continued) quickly gained attention and support from leadership. Over time, as the platform evolved through the iterative process, additional data sources were added. Eventually, the system combined 11 additional data sources with the original three, all in real time. These additional sources include data from HERE Traffic, iCones, Twitter, KYMEsonet, CoCoRahs, TMC reports from two TMCs, RWIS, county activity reports from 120 counties, dynamic message signs, and truck parking. The capabilities of the system and the data drew considerable attention from multiple divisions within KYTC, including the Division of Environmental Analysis, the Intelligent Transportation Systems group, the Office of Highway Safety, the Department of Motor Carriers, the Division of Planning, the Division of Traffic, and the Division of Maintenance. Each of these divisions now benefits from the data collected and processed through the big data system in some manner: • The Division of Environmental Analysis uses some of the data to help it report on environmental impacts to the roadway network as part of the MAP-21 requirements. • The Intelligent Transportation Systems group uses the system for real-time crash detection across the entire roadway network, publishes information to Waze to help mitigate congestion due to road closures or other issues, and leverages big data to provide real-time monitoring of the roadway with 0.01-mile and 2-minute precision. • The Office of Highway Safety can leverage the system to produce precise after- action reviews of crashes. • The Department of Motor Carriers now publishes real-time, post-processed data to mobile applications such as PrePass to inform the trucking community of roadway issues. • The Division of Planning leverages the big data architecture to quickly run calculations on historic traffic data to measure the performance of the roadway network. • The Division of Traffic uses the system to assist with signal optimization research in addition to providing data to an existing software vendor for dynamic signal timing. • The Division of Maintenance fully leverages big data to provide a robust and comprehensive snow and ice decision support system. KYTC was also able to replace the legacy, vendor-based 511 system with a new in-house traveler information system. By doing so, KYTC was able to eliminate a $750,000 per year contract. With the big data environment in place, this repurposing of the data and recreation of the 511 system required less than 200 hours of combined labor.

Roadmap to Managing Data from Emerging Technologies for Transportation 45 Step 7. Demonstrate the Value of the Data to Executive Leadership After one or more iterations of Steps 2 through 6 and growing the number of use cases and pilot projects within the big data environment, it may be time to begin marketing the data, environment, results, outputs, and benefits to higher-level executives within the agency. While Step 7 is similar to Step 6 in many ways, the task of selling the data and approach vertically within the organization will likely come with a different set of challenges than selling them horizontally within the organization. In Step 7, do the following: • Present the success stories and business case to executives. • Continue to build support, foster data sharing, and grow incrementally. • Push for organizational change and adoption of a formal big data environment. Present the Success Stories and Business Case to Executives The importance of communicating effectively with leadership in terms that they value cannot be overemphasized. Make certain to address clear business needs within the first few minutes of the conversation or presentation. Avoid discussing specific technical details except where absolutely necessary; focus instead on presenting the project results in terms of measurable benefits, resources saved, or new capabilities or efficiencies gained. Create a logical link between the demonstrated benefits of the pilots and the benefits of expanding them more widely across the organization. Similar conversations or presentations should be sought with executives representing a variety of business areas. It should be anticipated that executives across the organization will have special needs or concerns when it comes to migrating their data from local data siloes to a unified data lake architecture. Taking the time to educate leaders about data access manage- ment and security within a data lake environment could prove to be invaluable in reducing hesitation and avoiding implementation delays. Each step of building the big data management capabilities to effectively handle data from emerging technologies has involved communicating concepts and educating colleagues. As the scope of effort expands from individual pilot projects to organizational change, many agencies will find it is no longer feasible to have a small team shoulder the burden of spreading knowl- edge. Unless the organization is very small, recruiting more data champions across as many teams within the organization as possible is strongly recommended. By delegating some of the education effort to these champions, nearly every member of the organization can gain essen- tial understanding without placing an unreasonable burden on any one team. Continue to Build Support, Foster Data Sharing, and Grow Incrementally As iterations of Steps 2 through 6 may be necessary to get to Step 7, further iterations involving Step 7 (as illustrated in Figure 8 on page 47) may also be required to gain buy-in from business units or higher level executives. In other words, as various data products are demonstrated to executives, there will likely be questions and requests that require circulation back to the use case development cycle of Steps 2 through 6. A successful outcome of Step 7 (and of any associated iterations) would be for an executive to declare that “This is the way we’re going to go!” Taking the time to educate leaders about data access management and security within a data lake environment could prove to be invaluable in reducing hesitation and avoiding implementation delays.

46 Guidebook for Managing Data from Emerging Technologies for Transportation Common Misconception: Data Owners Have Less Control Over Their Data After Uploading the Data to a Data Lake Business unit siloes, and associated data siloes, are and will continue to be a barrier to the ability of transportation agencies to leverage data from emerging technologies. Rigid processes around how data are managed and analyzed is a traditional approach to data management that was developed to facilitate fast query speeds and low error rates when working with relational database management systems. Today this approach is outdated and cannot be applied to new, big data sources. One common misconception that drives the traditional approach of “keeping the data close to the vest” is that the data owners will have less control over their data if the data are moved into a common data lake environment. However, cloud-based storage follows the same concepts of user authentication and authorization as in traditional storage: one is no more “open” than the other one is. With cloud storage, the data owner can restrict the data to as many or as few users as they like. Unauthorized users can be prevented from accessing the data or even from seeing that the data exist. In addition to the fear of losing control over data access, some data owners also believe that migrating data to a data lake will mean a loss of control over data quality. These data owners may have been the sole authoritative source of the data or they may have business needs that require that the data be cleaned or manipulated in some particular way prior to being presented. Such data owners may mistakenly perceive that if their data are shared in a data lake, then other business units or analysts may interpret their data incorrectly or present conflicting results. Such a situation can be easily avoided by limiting access to data that may be prone to misinter pretation or misuse or by only making the final version of that data visible to all users. A common misconception that drives the traditional approach of “keeping the data close to the vest” follows. Note that cloud-based storage follows the same concepts of user authentication and authorization as in traditional storage.

Roadmap to Managing Data from Emerging Technologies for Transportation 47 Figure 8. Iterative process to generating interest and buy-in vertically within the organization. Case Study: Buy-in from Executive Leadership At KYTC, after the big data, proof-of-concept system had proved useful for snow and ice operations and real-time monitoring for the Division of Maintenance and Intelligent Transportation Systems, the architecture and data gained additional attention and support from executive leadership. (continued on next page) Push for Organizational Change and Adoption of a Formal Big Data Environment After what will likely be many iterations of Steps 2 through 7, there will come a point in which there are enough use cases, projects, and data users throughout multiple business units across the agency, and there will be enough recognition of the benefits of the data and the big data environment by leadership. This data sharing and use throughout the agency should support the claim that the organization is not only ready for change but also that this change is vital to continue to develop and support data-driven decision-making organization-wide. At this point, it is time to push for organizational change and adoption of a formal big data environment. This push ultimately will need to come from executive leadership as a top-down initiative within the agency. As necessary, arm these executives with as many success stories and business cases to support the argument that change is absolutely necessary for the long-term success of the agency.

48 Guidebook for Managing Data from Emerging Technologies for Transportation Step 8. Establish a Formal Data Storage and Management Environment Step 8 ends the focus on individual pilot projects and the test environment and begins the establishment of organization-wide big data management. As illus- trated in Figure 9, an agency will likely arrive at Step 8 only after many iterations of Steps 2 through 7. The key focus of this step should be on scaling up existing capabilities to serve a wider audience within the organization. There will be many possible new capa bilities to develop or improvements to pursue, but it will be more effective to first expand the reach of the exist- ing technology and proven data management approaches to avoid getting lost among the fine details. The more well developed the initial data projects are and the more the data playground reflects the planned data lake infrastructure, the easier this process will be. To successfully complete Step 8, the agency will need to commit to a paradigm shift, a culture change, and a process capable of building on the data playground projects to shift the organization from opinion driven to data-driven decision-making. As a state DOT partic ipant in this research said, “Let’s be honest, it’s ‘data-informed,’ not ‘data-driven.’ We just use that term because people understand what it means. There’s no way we’ll ever let data make the decisions over a human because we can’t always trust our data.” However, as has been stated, the data are becoming too big for humans to process. Nevertheless, the shift will occur progressively as people within an organization begin to trust in the data, the process, and the novel data products generated. Case Study: Buy-in from Executive Leadership (Continued) With this additional executive support, the three-person big data team became the single point of contact within the agency for all things related to real-time data. The project was officially renamed to portray a broader, agency-wide scope to increase acceptance from other business units. Executive support escalated to the point that all other development efforts were compared and contrasted against the big data architecture and development processes. In one specific case, a business unit within the agency had a project that included creating a real-time system for notifying a third-party application about roadway hazards. Once executive leadership learned of the specifics of the project, both the project team and the big data team were told to meet and determine where they could eliminate redundancies. Once the redundancy of the system was identified, the project team was told to use the existing big data architecture. Another phenomenon that happened rather organically was that executive leadership started referring other divisions to consult with the big data team any time they perceived that the data or architecture might be of assistance to the different areas of the agency. Over time, the big data group became something of an internal consulting service to the other departments within the agency, thereby growing the influence and exposure of big data. One example of this was when the agency wanted to solve an issue of frequent crashes and congestion within a specific corridor. In addition to the traditional engineering team members, leadership tasked the big data team to develop an analysis of the corridor. In the end, the results obtained by the big data team greatly reinforced what the engineering staff already understood about the corridor, but the extent of the problems came at a surprise to the team. The more well developed the initial data projects are and the more the data playground reflects the planned data lake infrastructure, the easier this process will be.

Roadmap to Managing Data from Emerging Technologies for Transportation 49 First, business areas that are the easiest to migrate and that boast the most supportive and enthusiastic teams should be shifted. Then, more difficult business areas should be progres- sively migrated. Slowly, as business areas are added incrementally, an organization-wide modern data environment supporting both pilots and production projects and associated modern data management policies should be developed. This step is often where organizations fall into a trap; they often go too fast and try to inte- grate the newly created data pipelines into their traditional IT infrastructure. This is very common and results in promising modern data projects redesigned into traditional ones in order to comply with the traditional IT policies. The result is that the progression toward modern data practices is completely stalled. Instead, organizations need to proceed with the transition slowly and progressively by growing the data playground, in parallel with the traditional data projects, into an organization-wide data environment that eventually will become the state of the practice in data management within an agency. To achieve this, organizations need to focus on five objectives: 1. Establish clear vision and goals. It is very important to develop a vision before introducing modern data practices agency-wide, and executives will need to develop and present this vision to the rest of the organization. In the vision statement, it will be vital to provide rational arguments for this change and shift in organizational culture, present the benefits and future plans, and resolve doubts. 2. Make data accessible yet secure. Data are at their most valuable when they are accurate, completely secure, and trusted; this has been achieved traditionally by applying data gover- nance policies that limit access to the data. Yet a data-driven culture requires data openness, letting teams access data and consider new data-driven approaches. Executives will need to develop policies that allow employees across the organization to easily access and process data while at the same time keeping the data under control and secure. WHAT IS DATA-DRIVEN DECISION-MAKING? Progress in an activity is compelled by data and not by intuition, personal experience, or political agenda. Data are now too large, fast, and change too quickly for the latter. Figure 9. Iterative process to arrive at Step 8. A data-driven culture requires data openness, letting teams access data and consider new data-driven approaches.

50 Guidebook for Managing Data from Emerging Technologies for Transportation 3. Integration at the data level. Traditionally, data integration has been accomplished at the IT infrastructure level based on predefined requirements. This approach is too rigid and costly to support the many rapidly changing data integration needs arising within a data-driven organization. Instead of adopting IT infrastructure level integration, executives need to sup- port integration at the data level and develop governance policies that will allow data located in the shared modern data storage to be integrated by each business area as the business area sees fit using the tools of their choice. By integrating at the data level, the organization will be able to adapt more quickly to changes in data sources, data tools, business goals, and business objectives without requiring an expensive infrastructure expenditure each time. 4. Connect data to business goals. Executive and middle management need to tie data to business goals by developing data-based goals through the establishment of a set of valu- able business key performance indicators that are satisfactory to the end users and can be tracked and monitored to support the organization’s internal processes. 5. Use data to make decisions. Finally, executives should lead by example and foster the use of data for decision-making within the organization. Most organizations are still hesitant to make data-driven decisions due to the lack of fidelity, granularity, and details found in traditional data sets and the risk associated with making the wrong decision. To create a data-centric culture in an agency, executives should design processes that support data-driven decisions. Step 8 will encompass all four stages of the modern big data management framework: create, store, use, and share. If not done already, it is strongly recommended to complete the accom- panying Data Management Capability Maturity Self-Assessment (DM CMSA) before moving forward. Doing so will help an agency understand strengths, weaknesses, and effectively plan an expansion into a formal data storage and management environment. In addition to the DM CMSA, this guidebook contains a description and framework for big data governance, as well as a tool for tracking the big data governance roles and responsibilities within an agency (pages 75–99). During the expansion, questions may be encountered that had not been previously consid- ered or some approaches that worked at a project level may be determined to be inadequate for data management at a larger scale. Do not be afraid to remain flexible with developing plans and processes even at this stage. Continue to seek input from other stakeholders and iterate the evolving data governance plans and procedures. If not already done, put a priority on merging existing projects into the same data infra- structure. This will help data users recognize this effort as a nascent organizational big data management effort and not simply a collection of pet projects. Delegating additional outreach efforts to identified data champions will also help fight this perception. It is important to understand that, even after the data playground has evolved into a production-ready data lake and relevant stakeholders have adopted all associated projects across the organization, the efforts to evolve and improve these modern data management practices never truly end. New systems, data sets, and best practices will emerge that provide opportuni- ties to improve and refine the approaches developed while following this Roadmap. Below is a list of items to be periodically reviewed and revised as part of continuous improvement efforts. • Currently used and potential data sets. A catalog of available data sets that includes the contents of the data set, the applications that use the data, and the potential opportunities for new users for the data will assist stakeholders in prioritizing their data development efforts and reduce costs. The accompanying Data Usability Assessment Tool can be used to assist with this process. • Currently used and potential technology. A list of technology, including what the technology is used for and the costs involved, is intended to calculate the potential ROI of migrating from Executives need to support integration at the data level and develop governance policies that will allow data located in the shared modern data storage to be integrated by each business area as the business area sees fit using the tools of their choice.

Roadmap to Managing Data from Emerging Technologies for Transportation 51 an aging piece of technology to a newer piece of technology. This could entail replacing a part of the data pipeline with a more efficient process or updating data analysis/visualization tools with more feature-rich options. • Processes and procedures. Reviewing procedures includes both finding areas where more uniform processes would be helpful to reduce confusion and finding areas where over- regulation or red tape bureaucracy is hindering efficiency. • Documentation. Identifying what areas have missing or outdated documentation. One agency has adopted the use of wikis to handle this issue. Two different wikis are used for documentation: one for developers, where detailed technical documentation and emergency contact information exists and another for the entire organization, where users can find details about the different data sets and hints on how some of the data are being used already. This has greatly reduced the agency’s overhead in maintaining that documentation by a few select people. • Security and privacy protection. Identifying new cybersecurity technology and techniques that have been developed, as well as which algorithms and methods are no longer secure. Ensuring all security software is up to date. If working with a cloud provider who manages security, ensuring the provider continues to meet contractual obligations. • Metadata catalog. Finding areas where additional metadata fields may be helpful. Reviewing existing data to ensure that metadata enrichment is applied appropriately. When reviewing these areas for continuous improvement, agencies may find it useful to complete the accompanying DM CMSA. This tool is designed to help identify areas in which an agency’s data management approaches can be updated and improved. The tool can also be helpful to review the accompanying modern big data management framework and compare current practices with best practices and recommendations listed in each stage of the data management life cycle. Common Misconception: Transportation Agencies Must Permanently Delete All Sensitive Data Immediately to Protect Themselves Against the Risks Involved with PII Total avoidance of collecting or storing any personally identifiable information (PII) or other sensitive data is rarely necessary to protect an organization and doing so may cripple data analysis now and in the future. The modern approach is to instead preserve and secure the raw data first then anonymize or aggregate sensitive data as needed. These techniques are quite effective at removing sensitive data and, by storing the raw sensitive data on a separate server with restricted access, that data can be made very secure. Raw data are the lifeblood of big data analytical techniques, so taking simple steps to preserve raw data is critical to modern data management. Even when using the most basic analytical techniques, it is important to have raw data to check data quality and investigate outliers. If a data analyst sees something unusual in the data, it is often impossible to confirm whether the anomalous data readings are legitimate if the raw data have been discarded. Due to both the value of raw data and the relative ease at which raw data can be preserved, it is usually preferable to secure and use sensitive data rather than discarding or avoiding.

52 Guidebook for Managing Data from Emerging Technologies for Transportation Case Study: Continued Room for Growth While many organizational units within KYTC have started to understand and experience the benefits of big data architecture, the system still has much room to grow before all the benefits can be fully realized. The democratization of data and end-user engagement remains a barrier to this day, and the reasons are complicated, as with any new technology adopted by any agency. One reason for slow adoption is simply the lack of time available to end users to learn about new technology that may not be critical to that individual’s day-to-day tasks. With public sector employees taking on additional responsibilities as staffing dwindles, it can be difficult to prioritize the need to learn something slightly outside a worker’s norm in the hopes that it may produce a gain. Another reason is simply the infrastructure, in terms of network bandwidth, available to users in remote offices and the tools they may have available to them. Using client-side business intelligence tools can be time-consuming and frustrating to end users connected to the network by slower than optimal network connections. But this use also adds to the urgency of needing to adopt different architecture so agencies can move computing and reporting to lighter and faster cloud-based tools.

Next: Chapter 4 - Modern Big Data Management Life Cycle and Framework »
Guidebook for Managing Data from Emerging Technologies for Transportation Get This Book
×
 Guidebook for Managing Data  from Emerging Technologies for Transportation
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With increased connectivity between vehicles, sensors, systems, shared-use transportation, and mobile devices, unexpected and unparalleled amounts of data are being added to the transportation domain at a rapid rate, and these data are too large, too varied in nature, and will change too quickly to be handled by the traditional database management systems of most transportation agencies.

The TRB National Cooperative Highway Research Program's NCHRP Research Report 952: Guidebook for Managing Data from Emerging Technologies for Transportation provides guidance, tools, and a big data management framework, and it lays out a roadmap for transportation agencies on how they can begin to shift – technically, institutionally, and culturally – toward effectively managing data from emerging technologies.

Modern, flexible, and scalable “big data” methods to manage these data need to be adopted by transportation agencies if the data are to be used to facilitate better decision-making. As many agencies are already forced to do more with less while meeting higher public expectations, continuing with traditional data management systems and practices will prove costly for agencies unable to shift.

Supplemental materials include an Executive Summary, a PowerPoint presentation on the Guidebook, and NCHRP Web-Only Document 282: Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!