National Academies Press: OpenBook
« Previous: Chapter 4 - Modern Big Data Management Life Cycle and Framework
Page 73
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 73
Page 74
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 74
Page 75
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 75
Page 76
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 76
Page 77
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 77
Page 78
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 78
Page 79
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 79
Page 80
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 80
Page 81
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 81
Page 82
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 82
Page 83
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 83
Page 84
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 84
Page 85
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 85
Page 86
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 86
Page 87
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 87
Page 88
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 88
Page 89
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 89
Page 90
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 90
Page 91
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 91
Page 92
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 92
Page 93
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 93
Page 94
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 94
Page 95
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 95
Page 96
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 96
Page 97
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 97
Page 98
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 98
Page 99
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 99
Page 100
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 100
Page 101
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 101
Page 102
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 102
Page 103
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 103
Page 104
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 104
Page 105
Suggested Citation:"Chapter 5 - Supporting Tools." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.
×
Page 105

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

73 This chapter contains a variety of tools in support of the Roadmap. They are as follows. • Data Management Capability Maturity Self-Assessment (DM CMSA). The DM CMSA is built from the modern big data benchmark and assessment methodology presented in NCHRP Web­Only Document 282: Framework for Managing Data from Emerging Transporta­ tion Technologies to Support Decision­Making. Questions within each of 15 data management focus areas will guide transportation agencies through a self-assessment to gauge their current data management practices, as well as to identify specific areas for improvement along their path toward shifting from traditional data management practices to modern data manage- ment practices, in order to handle data from emerging technologies. • Big Data Governance Roles and Responsibilities. Big data governance roles and respon- sibilities provide a list of recommendations to consider when developing a modern data governance approach, a description and framework for big data governance, and a tool for tracking the big data governance roles and responsibilities within an agency. • Data Sources Catalog Tool. This tool is provided to assist transportation agencies in catalog- ing existing and potential data sources. The tool is useful for better understanding the data assets of an agency, prioritizing data sources, and informing the selection of those sources that might offer the most value before pursuing further development. • Frequently Asked Questions or FAQs. The FAQs comprise a list of questions and responses to frequently asked questions regarding big data implementation, management, governance, use, and security. Data Management Capability Maturity Self-Assessment This section contains the Data Management Capability Maturity Self-Assessment (DM CMSA) developed as part of the research. The DM CMSA will allow transportation agencies to gauge their data management practices, as well as to identify specific areas for improvement. The self- assessment was designed for ease of completion and to provide a high-level starting point for further inquiry. The self-assessment consists of 104 questions divided across 15 focus areas. Eleven of these focus areas were derived from the data management “knowledge areas” described in the DAMA DMBOK2 (shown in Figure 135) as follows: • Data architecture—The overall structure of data and data-related resources as an integral part of the enterprise architecture. C H A P T E R 5 Supporting Tools 5 Single use permission for the DAMA­DMBOK2 Guide Knowledge Area Wheel. No redistribution rights. Contact The Data Management Association for use in other documents.

74 Guidebook for Managing Data from Emerging Technologies for Transportation • Data modeling and design—Analysis, design, building, testing, and maintenance of data. • Data storage and operations—Structured physical data assets storage deployment and management. • Data security—Ensuring privacy, confidentiality, and appropriate access to data. • Data integration and interoperability—Acquisition, extraction, transformation, movement, delivery, replication, federation, virtualization, and operational support. • Document and content management—Storing, protecting, indexing, and enabling access to data found in unstructured sources (electronic files and physical records) and making these data available for integration and interoperability with structured (database) data. • Reference and master data—Managing shared data to reduce redundancy and to ensure better data quality through standardized definition and use of data values. • Data warehousing and business intelligence—Managing analytical data processing and enabling access to decision-support data for reporting and analysis. • Metadata—Collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata. • Data quality—Defining, monitoring, maintaining data integrity, and improving data quality. • Data governance—Planning, oversight, and control over management of data and the use of data and data-related resources. ©DAMA International, www.dama.org Reference & Master Data Document & Content Management Data Integration & Interoperability Data Security Data Storage & Operations Data Modeling & Design Data Architecture Data Quality Metadata Data Warehousing & Business Intelligence Figure 13. DAMA-DMBOK2 knowledge areas (DAMA International 2017).

Supporting Tools 75 In addition, the following four focus areas were included to expand the scope of data management to consider the full life cycle of big data (i.e., create, store, use, and share): • Data collection—Acquiring new data, either directly or through partnerships, in such a way that the value, completeness, and usability of the data are maximized without compromising privacy or security. • Data development—Designing, developing, and creating new data products, as well as augmenting, customizing, and improving existing data products. • Data analytics—Investigating processed data to drive actionable insights and answer questions of interest for an organization. • Data dissemination—Sharing data products and data analysis results effectively with appropriate internal and external audiences. The DM CMSA contains 15 tables (Tables 5 through 19), with one table corresponding to each of the 15 data management focus areas. The questions in each of the tables were derived from the foundational principles of big data management and the corresponding bench- mark methodology, which are presented in NCHRP Web­Only Document 282: Framework for Managing Data from Emerging Transportation Technologies to Support Decision­Making. Every question calls for a self-assessment score of “low,” “medium,” or “high” and provides examples of what practices or procedures would merit each score. At the end of the self-assessment, a summary scoresheet (Table 20) is provided, in which all recorded answers can be totaled across the 15 data management focus areas. The summary score sheet should provide an overall measure of an organization’s data practices, particularly as they relate to managing data from emerging technologies for transportation. Given the rather siloed nature of data within most transportation agencies, it is recommended that representative groups take the self-assessment across an agency. After completing the self-assessment, it is recommended that the individual responses be reviewed to identify areas where the most improvement can be made. The descriptive examples included in each question will help identify changes that need to be made and goals to be pursued that will advance the organization’s data management processes and practices. Big Data Governance Roles and Responsibilities Data governance is a collection of practices and processes that help to ensure the formal management of data assets within an organization, including the planning, oversight, and control over management of data and the use of data and data-related resources. Data gov- ernance puts in place a framework to ensure that data are used consistently and consciously within the organization. Data governance also deals with quality, security and privacy, integ- rity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (Roe 2017). Traditionally, data governance dealt with the strict, authoritative control of data systems and users. According to Wells, traditional data governance operates on the fundamental premise that data cannot be governed; only what people do with the data can be governed. While this may have been a feasible approach for traditional data systems, modern data systems, which incorporate agile development, big data, and cloud computing, have rendered this approach much more challenging to implement.

Question Scoring Score Low Moderate High Considering the data collected by the agency, how relevant are the data to the current agency needs? Data collected are not relevant to current agency needs. Data collected are somewhat relevant to current agency needs. Data collected are highly relevant to current agency needs. Are the data collected relevant to the future agency needs? Data collected are not relevant to future agency needs. Data collected are somewhat relevant to future agency needs. Data collected are highly relevant to future agency needs. How significant are the gaps in the current data for meeting agency needs? Significant gaps in current data to meet agency needs. Some gaps in current data to meet agency needs. Little or no gaps in current data to meet agency needs. In what formats are the data collected by the agency? Data collected are in outdated or proprietary formats. Data collected are in a usable, but not ideal, format. Data collected are in modern, open source formats. To what extent are the source data collected by the agency preserved prior to any editing or other modification? Source data are usually deleted or modified. Source data are modified but not deleted. Source data are never deleted or modified. How is personally identifiable information (PII) within the source data handled? PII is collected and handled via an insecure process. PII is secured or anonymized at some point after collection. PII is collected securely. Are documented data collection procedures available and routinely updated? There are no documented data collection procedures. Documented data collection procedures exist but are infrequently reviewed/updated. Documented data collection procedures are frequently reviewed and updated. # of Low Scores # of Moderate Scores # of High Scores Table 5. Focus area: data collection.

Question Scoring Score Low Moderate High Has the agency referenced any existing data models or frameworks when designing their data architecture and processes? No data models or frameworks referenced. Some research performed prior to data workflow design. Extensive knowledge of applicable data frameworks used. Has the agency performed any data usability assessments for the current data sources? No data usability assessments have been performed. Basic inventory of data sources performed with limited information. Full data usability assessment performed and regularly updated. How flexible is the data workflow model? Data model does not allow for ad hoc data augmentation or other continuous development practices. Data model allows for some ad hoc data augmentation or other continuous development practices. Data model is designed to fully implement continuous development practices, including ad hoc data augmentation. Does the data model include augmenting data sets with metadata designed to enhance usability, such as data quality and provenance identifiers? Data model does not account for adding any data usability focused metadata. Data model has steps to add some usability-focused metadata. All appropriate metadata additions that enhance usability are fully accounted for by the data model. Does the data model allow for data- masking techniques to be applied, where sensitive information is anonymized or obscured? Data model does not include any data-masking techniques. Source data are deleted rather than masked. Data model includes insufficient masking techniques or masking techniques are inconsistently applied throughout the model. Data masking is fully accounted for by the data model in all areas where it may be useful. Has the agency designed new processes specific to handling data from emerging technologies? No new processes for emerging technologies have been considered. Some modifications have been made to existing data models and processes to handle emerging technology data. New processes have been designed from the ground up to support emerging technology data. # of Low Scores # of Medium Scores # of High Scores Table 6. Focus area: data modeling and design.

Question Scoring Score Low Moderate High How well is the agency’s data organized? Data are organized haphazardly. Data are organized adequately. Data are organized optimally. Do the file and folder names follow a documented naming convention? Folders and files follow no common standard or convention. Folder and file names generally make sense but do not follow documented conventions. All folder and file follow a documented naming convention. How well do the data schemas meet the needs of your analysts? Data schemas do not meet the needs of analysts. Data schemas are generally functional. Data schemas that best meet the analysts’ needs are used. Are data tables generally "well formed," with one subject per column and one piece of information per row? Tables are not well formed. Tables are generally organized. Tables are fully well formed. Do data administrators, both internal and external, provide sufficient support to maintain the data architecture? There is little or no ongoing support for maintaining the data architecture. There is adequate support for most maintenance needs. There is optimal support for all maintenance needs. If the agency needs to quickly respond to a change in data storage needs, how difficult or costly would that change be? Data architecture is outdated, rigid, or vendor locked. Rapid change is impossible. Data architecture relies on closed source software/hardware or is not well documented. Rapid change is difficult and costly. Data architecture is built on well-understood open source services. Rapid change is possible with little difficulty. Is the data architecture distributed enough for some processes to be changed or discarded without affecting the whole? All processes are fully dependent on each other such that any change in one process necessitates system- wide modifications. Some processes have well- documented dependencies that require additional work to be done if they are to be updated or changed. All processes are independent enough to be individually modified with minimal impact on the other processes. # of Low Scores # of Medium Scores # of High Scores Table 7. Focus area: data architecture.

Question Scoring Score Low Moderate High How much of the data collected by the agency (and that could be useful/relevant to the organization’s needs) are stored? Only some data are stored. Most data are stored. All relevant data are stored. Are the agency’s data consolidated in a central location for storage and analysis? Data are stored in separated data silos. Data are centrally stored but must be copied to a separate system for analysis. Data are stored in a fully functional data lake architecture. How long are data preserved/stored for future use? Some data are only stored for a short period of time due to size or legal reasons. Some data are stored only long enough to perform needed analyses. Data that are not perceived to be useful are seldom stored. All data are stored as long as possible to support current and future analyses, even if those analyses are not actively in use today. In what type of format are the data stored? Data are stored in an outdated or proprietary format. Data are stored in a usable but obscure or difficult format. Data are stored in a well-known, modern open source format. How often are backups of the data created? Backups are rarely performed. Backups are performed every so often. Backups are frequently performed and verified. Where are the backup data stored? Backup data are stored onsite. Backup data are stored both onsite and at a single offsite location, such as a separate on premise storage facility or data center. Backup data are stored at multiple offsite locations or by a reputable cloud service provider. How quickly can the agency recover data from backup storage after a disruption? Unacceptable time to recovery. Adequate time to recovery. Excellent time to recovery. Are records of the data's history and origin maintained? No history or origin of the data files is maintained. The history and origin of some data files are maintained. The history and origin of all data files are maintained. Table 8. Focus area: data storage and operations. (continued on next page)

Question Scoring Score Low Moderate High Does the agency maintain a documented disaster recovery plan? No disaster recovery plan. Some processes are in place for disaster recovery, but they are not frequently reviewed or updated. Disaster recovery plan is frequently reviewed and updated. Does the data system architecture fully meet the needs of analysts? Data system architecture is insufficient and error prone. Data system architecture is adequate. Data system architecture is optimal for analyst's needs. Does the organization rely on closed source/proprietary software to manage the data? Software and environment are closed source and proprietary. Some software used is open source. All software used is well supported and open source. Does the architecture incorporate cloud-based systems where appropriate? The data architecture is on premise. Some systems are cloud-based. All systems are in a cloud-based environment. Are the agency’s data processes designed to be independent of the underlying systems used? Data processes are built directly into the system and are difficult to update. Some data processes are kept independent of the system itself. All data processes are independent and can be upgraded or replaced easily. # of Low Scores # of Medium Scores # of High Scores Table 8. (Continued).

Question Scoring Score Low Moderate High How is sensitive information/PII within the data stored? Sensitive information/PII is stored in plain text. Sensitive information/PII is stored in a somewhat secure manner. All sensitive information/PII is fully secured from collection to data product. Are privacy filters applied to anonymize the data? No privacy filters are applied. Some privacy filters and/or encryption is employed for PII. Privacy filters and other safeguards are applied at the time of collection. Are privacy filters granular enough to allow different analyses to be performed at different levels of access? No granularity in the privacy filters. A record is either flagged as sensitive and filtered or it is not. Records are labeled with some granular level of sensitivity, but no easy means exist to selectively filter records based on this label. Records are given detailed labels that describe their sensitivity, and analytical processes are able to filter or obfuscate records at varying levels according to their access and needs. How are the network and other infrastructure secured? No network encryption or endpoint protection. Basic level of network and endpoint security. All relevant security software and procedures are employed. Does the agency employ secure user authorization processes? Insecure authentication processes fail to prevent unauthorized use of data. Outdated or inadequate authentication processes fail to fully secure data. Authorization processes are up to date and fully prevent all unauthorized use. Do the authentication processes hinder authorized access to the data? Rigid authentication structures hinder authorized use of data. Authorization structures somewhat hinder authorized use of the data. Fluid and convenient authorization structures do not hinder authorized use of the data. How efficiently is the agency able to add or manage user access to the data? A great amount of time and effort is required to grant access to a new user. Some amount of time and effort is required to grant access to a new user. It is easy to grant new users access when warranted. Are customized privacy protocols, tailored to the different data stakeholders, employed? No changes in privacy protocols are made with respect to data stakeholder groups. For some, but not all, stakeholders, customized privacy protocols are applied. All relevant data stakeholder groups are handled with their own customized privacy protocols. Is outsourced cybersecurity expertise utilized? Outside expertise is never utilized on any cybersecurity matter. Outside expertise is infrequently consulted or the outputs from that consultation cannot be independently audited. Outside expertise is heavily utilized, and in- house experts are qualified to independently audit and verify third-party findings and recommendations. # of Low Scores # of Medium Scores # of High Scores Table 9. Focus area: data security.

Question Scoring Score Low Moderate High Is data quality monitored? Data quality is unknown. Data quality is somewhat known. Data quality is fully known and actively monitored. Is a standardized system of ranking data quality employed? No data quality rankings are performed. Basic data quality rankings are performed. Detailed data quality rankings are performed. Are dashboards used to visualize data quality statistics? No data quality dashboards are available. Some data quality dashboards are available. Many data quality dashboards and supporting tools are available. How are suspect data flagged for review? Few processes are in place to flag low quality data. Manual processes are in place to flag low quality data. Both automated and manual processes are used to flag low quality data. Are users and data processes able to select a level of data quality to use for their analysis? No ability to select data based on data quality rankings or flags. Users can filter data at a low granularity or with some difficulty. Users are able to easily filter data based on data quality rankings at a high level of granularity. When working with data quality, are the original data ever corrected, modified, or deleted? When data quality concerns arise, the source data are almost always deleted or heavily modified. When data quality concerns arise, the source data are sometimes modified, corrected, or deleted. When data quality concerns arise, the source data are flagged and scored but never modified or deleted. Are any data crawling tools employed to continuously monitor data quality trends? No data crawling is performed at any level. Data crawling is performed infrequently or through a manually initiated process. Fully automated data crawling is continuously performed, generating timely and detailed alerts on data quality trends. Are users able to report data quality issues? End users of the data are unable to report data quality issues. End users may report data quality issues, but those reports are infrequently reviewed using a fully manual process. End users may report data quality issues, and those reports are frequently and easily reviewed via a partially automated process. # of Low Scores # of Medium Scores # of High Scores Table 10. Focus area: data quality.

Question Scoring Score Low Moderate High Does the agency have full ownership of and unrestricted access to the data that they obtain from third parties? In most cases, the third party owns the data and severely restricts access and use. In some cases, the agency owns the data from the third party but must comply with rigid use restrictions. In most cases, the agency owns the third-party data and may fully use it with few restrictions. Is the agency limited by high costs in accessing and using data relevant to their needs? Very high cost to use data. High cost to use data. Reasonable cost to use data. Are users restricted to using only certain tools when analyzing data of interest? Users are restricted to a small number of proprietary analysis tools. Users can use a range of tools (with some restrictions), mostly proprietary but some open source, to analyze the data. Users can use any number of tools to analyze the data and with few restrictions. Is access, use, or analysis of data limited by agency data management policies and practices? Agency data management policies and practices severely limit access, use, or analysis of data. Agency data management policies and practices somewhat limit access, use, or analysis of data. Agency data management policies and practices do not limit access, use, or analysis of data. Is data management software used by the agency? No data management software is used. Data management software is used, but it is not optimal. Optimal data management software is used. Does the agency follow a documented data management plan? The agency has no documented data management plan. The agency follows a loose, largely undocumented data management plan. The agency follows a documented and frequently updated data management plan. Does the agency have in-house data management experts or does the agency outsource data system management to one or more third parties? The agency relies on outside parties for data management. Some data systems are outsourced while others are managed in-house. The agency conducts all data system management in-house. Does the agency actively monitor their data management systems? There is no active system monitoring. Some system activity dashboards are available. The agency employs both reactive and proactive monitoring of their data management system. # of Low Scores # of Medium Scores # of High Scores Table 11. Focus area: data governance.

Question Scoring Score Low Moderate High Are data that the agency uses in a format that allows for easy integration into new systems? Most data cannot be integrated into new systems without significant effort. Some data must be converted into a new format before integrating into a new system. All data can be integrated without conversion or modification. Do all systems that process data within the agency rely on a centralized data source? Each system uses its own data type and siloed data source(s), making integration between separate systems difficult. Some systems connect to the same data source(s), while some retain their own siloed version of the data. All systems referencing the same data connect to a common data source for that data. Is operational support provided for the agency’s integrated data systems? No operational support is provided for the agency’s integrated data systems. Some general support from IT is available for the agency’s integrated data systems. Full support from skilled resources with advanced system knowledge is available for the agency’s integrated data systems. Are variable names in data sets mapped to existing data standards? Variable names change from data set to data set, with no standardized nomenclature. Some data sets have variable names mapped to some data standard. All applicable data sets are mapped to the same standard wherever possible such that they can be easily joined. How are data organized across data sets? No uniform organization plan; folder structures are unique to each data set. Some, but not all, data sets are organized using the same folder structure. All data sets are organized using a single planned folder structure. How are data classified across data sets? No uniform classification taxonomy. Some data sets use similar classification taxonomies. All data sets conform to a single, documented classification taxonomy. Are identification metadata consistently applied across data sets? No uniform metadata enrichment is performed. Some data sets have similar identifying metadata fields. All data sets are enriched with a uniform set of identifying metadata. # of Low Scores # of Medium Scores # of High Scores Table 12. Focus area: data integration and interoperability.

Question Scoring Score Low Moderate High Do stakeholders feel like the organization is getting its worth out of the data? Few stakeholders recognize the value of the data; data are seldom used to meet real business needs. Some business needs are met, but data operations are not highly valued or prioritized. Most stakeholders regularly derive real value from the data. Does the agency have access to sufficient Business Intelligence (BI) products? Few or no useful BI products are available. Some useful BI products are available. All current needs are satisfactorily met by a suite of BI products. Does the agency have sufficient data visualizations to reference and understand their data? Few or no useful data visualizations have been created. Some useful data visualizations are infrequently used. A variety of relevant and useful data visualizations are frequently referenced. Are data users able to develop their own BI products and visualizations? It is not possible for data users to develop or customize their own BI products or visualization within current processes and procedures. Data users can develop some limited customization of products and visualizations but only after a lot of administrative red tape. Data users can develop new BI products and visualizations with minimal administrative red tape or oversight. Are successful BI products and visualizations shared with other users who could benefit from them? BI products are siloed, so the original stakeholders for the original use case only use them. Some BI products and visualizations are infrequently shared among stakeholders. BI products and processes are regularly reviewed so that the most successful ones can be shared and emulated. Are stakeholders empowered to select the BI tools that are most useful for them? Technical limitations or data format incompatibilities limit what BI tools can be used. Some variety of BI tools are technically possible but limited by policy or organizational red tape. Stakeholders are able to choose their own BI tools without undue technical or administrative limitations. # of Low Scores # of Medium Scores # of High Scores Table 13. Focus area: data warehousing and business intelligence.

Question Benchmarks Score Low Moderate High Do the data need to be moved to a separate system for analysis? Full migration to a separate system is necessary to perform any analysis. Some data must be migrated to a separate system to perform some analysis. All analysis can be performed without copying or moving data. Are the data analyses run and the results of the analyses saved to the same location where the data are stored? All analysis results are saved on a separate system(s) from the data being analyzed. Some analysis results are written to the same location as the data. All analysis results are written to the same location as the data. Does the agency employ data analysis techniques designed for big data? Only traditional analytical techniques and processes are used. Some traditional analytical processes have been modified for infrequent use with big data. Relevant big data analytical techniques are actively used. Does the agency leverage analyses that have been designed by other agencies or the online community? No outside analyses have been referenced, copied, or built upon. Outside analyses are infrequently referenced or rarely used for production data. The agency frequently reviews, learns from, and uses relevant analyses from multiple outside sources. Does the agency have the means to perform analyses on live streaming data? No capabilities to analyze streaming data. Some capabilities to analyze live streaming data exist, but they are limited or infrequently used. Fully capable and actively deriving value from streaming data analyses. # of Low Scores # of Medium Scores # of High Scores Table 14. Focus area: data analytics.

Question Scoring Score Low Moderate High Does the agency perform or oversee the development of customized data products? No customized data products are developed; out-of-the-box solutions are used exclusively. Some data product development is outsourced, with little input from the agency. New data products are frequently developed and effectively used. Is a review process in place to identify effective new data enrichment possibilities? No review process performed regarding new data enrichment. New data enrichment is infrequently considered via an undocumented process. Data enrichment opportunities are regularly reviewed via a well- documented process. Is a review process in place to identify effective new data products? No review process performed regarding new data products. New data products are infrequently considered via an undocumented process. New data products are frequently considered via a well- documented review process. Can new analytical products be built on existing tools or must they be developed from scratch? All new data products must be built from scratch. Offshoot data products can be built with some difficulty. Current tools support easy development of additional products and visualizations. Can new analytical products be swiftly developed and iterated on? New products are developed slowly and "perfected" before being put into use. New products take considerable development work before they can be put into use. New products can be developed and put into use swiftly. # of Low Scores # of Medium Scores # of High Scores Table 15. Focus area: data development.

Table 16. Focus area: document and content management. Question Scoring Score Low Moderate High Does the agency maintain documentation for all data products and processes? No documentation of data products or processes is maintained. Some documentation of data products or processes is maintained in an offline format. Detailed documentation of data products or processes is available in an online, web-based format. Is the documentation regularly reviewed and revised? No reviews or revisions of the documentation since creation. Some documentation is sporadically reviewed. All documentation is regularly reviewed, revised, and updated. How often is documentation used by stakeholders? Documentation rarely read or followed. Some stakeholders are aware of documentation but seldom make use of it. All relevant parties uniformly follow procedures as documented. Documentation is only unavailable online. Documentation is available as a pdf download link only. Documentation is available in a live, searchable, online documentation framework. Are groups or stakeholders held accountable for the accuracy and availability of their documentation? No clear ownership of documentation responsibilities. Documentation ownership is clear, but there is no incentive for owners to keep documentation updated. All documentation is regularly reviewed, and owners are encouraged to update regularly. Are any automated processes employed to update data documentation? No automated processes are employed. Automated status checks are made, but there is no automated document editing available. Automated processes regularly update web documentation with information extracted from live data sets. # of Low Scores # of Medium Scores # of High Scores Is documentation available in an easy- to-access web documentation framework?

6 Reference data are data that define a set of permissible values to be used by other fields. Master data represent objects and all associated information about those objects that are relevant to the organization. In both cases, reference and master data management involve ensuring that these data remain consistent across all data sets in the organization. (Entry on Reference Data. (n.d.). Retrieved December 2019, from https://en.m.wikipedia.org/wiki/Reference_data) Question Scoring Score Low Moderate High Is reference documentation maintained for all databases and storage? No documentation of databases and storage is maintained. Some documentation of databases and storage is maintained in an offline format. Detailed documentation of databases and storage is available in an online, web-based format that is updated regularly. Are reference data uniform across all business units? There are mismatches in reference data across groups. Reference data are siloed or duplicated but uniform. Reference data exist in one location as a single source of truth for all users. Are master data values and identifiers consistently used across all systems? Master data values are inconsistent. Master data values exist in multiple siloed locations but are generally consistent. Master data values are stored and managed in one accessible location. Does a visual representation exist that shows how each data set relates to each other and/or how they can be combined? No visual representation of data set relations exists. A visual representation exists, but it is outdated or otherwise inaccurate. A visual representation exists in a regularly updated and highly legible format. Are data users able to easily access this visual representation of data set relations? Regular data users are unable to access the visual representation. Data users can only access the visual representation with some difficulty. Data users can readily access the visual representation. Can this visual representation of data set relations be quickly and easily updated as more data sets are created? Visual representation is stored in a format that is difficult to update (pdf). Visual representation is infrequently updated via a manual process. Visual representation is regularly updated through a largely automated process. # of Low Scores # of Medium Scores # of High Scores Table 17. Focus area: reference and master data.6

7 Metadata are data about data and are found in a metadata catalog, where users or programs can locate information about the data such as how large a file is, what format that file is in, when the file was last modified, what data types are stored within each column of a table, or whether a numeric value represents hours or minutes. Table 18. Focus area: metadata.7 Question Scoring Score Low Moderate High Does the agency keep and maintain a metadata catalog? No metadata catalog is maintained. A metadata catalog is maintained, but it only applies to some data. A metadata catalog is maintained for all applicable data. Does the agency enrich the data with additional metadata fields? No enrichment/additional metadata fields created. Some enrichment/additional metadata fields created. Optimal enrichment/additional metadata fields created. Are metadata practices regularly revised and updated? Metadata practices are seldom reviewed or revised. Metadata practices are infrequently reviewed or are ad hoc. Metadata practices are regularly reviewed and updated following a documented process. Are metadata transparent and available to those with access to the data? Metadata are never made available to data users. Some users may be able to access metadata fields for some data sets. All metadata for all data sets, along with associated documentation, are made available wherever appropriate. Is there a means of collecting feedback from data users regarding the available metadata? No means of collecting or implementing feedback from data users. Feedback from data users is not solicited or regularly reviewed but may sometimes be implemented if received. Feedback from data users is openly solicited and regularly reviewed. Are all metadata fields that apply to multiple data sets applied uniformly across the data sets? All metadata are data set dependent. Some groups of similar data sets are augmented with similar metadata fields. All data sets are augmented with the same well-documented metadata fields wherever possible. # of Low Scores # of Medium Scores # of High Scores

# of Low Scores # of Medium Scores # of High Scores Question Scoring Score Low Moderate High How open are the data sets within the agency? Data are unavailable to all but a few users (e.g., IT). Data are available to selected users who are expected to have use for them (some use within business units). Data are available to whoever may have a potential use for the data, with the exception of sensitive data. Has the agency implemented an open data policy? No thought has been given to implementing open data policies. Some open data policies are in use. Open data policies are applied wherever possible. Are there any technical barriers that prevent users from reaching the agency’s open data? Data can only be accessed internally. Some technical barriers exist, or data are only available via simple download. Data are easily reachable via APIs and/or hosted analytics platforms with no technical barriers. All users must copy or download the entire data set first before any analysis can be performed. Some users are able to analyze some data sets directly through a process where the organization shoulders all costs involved. All authorized users are able to access data directly where stored and analyze at their own cost. Are any developed data products available to users via an open sharing portal? No data products are shared. Some data products are shared via an unmonitored process. All relevant data products are shared with authorized users whose usage is monitored and who may bear some of the costs involved. Are users able to easily use their own tools and code with your open data API? Proprietary file formats or closed access prevent the use of nearly all data tools. Open file formats are used but outdated/incorrect documentation Open file formats and common protocols are used for maximum hinders use of non-standard data tools. compatibility with a wide range of current and future data tools. Are users with access to the data able to access the data directly where stored? Table 19. Focus area: data dissemination.

92 Guidebook for Managing Data from Emerging Technologies for Transportation Below is a list of recommendations to consider when developing a modern data governance approach, based on the work of The Next Generation of Data Governance by Dave Wells. Each recommendation has been divided into one of several aspects of data governance to consider during development (Wells 2017). • Agile data governance—Governance that adapts quickly to changes in data or analysis. – Focus on value produced and not methodology and processes. This includes value to the project and enterprise value produced by meeting governance goals. – Govern proactively. Introduce constraints as requirements at the beginning of a project instead of seeking remedial action at the end. – Strive for policy adoption over policy enforcement. Make it easy to comply with policies, communicate the reasons for policies, and communicate the value that is created by the policies. – Write brief, concise, clear, and understandable policies. Use simple language that is not ambiguous or subject to interpretation. – Include data governors and stewards on project teams. They bring valuable knowledge and are generally great collaborators. – Think “governance as a service” instead of “authority and control.” • Big data governance—Governance well suited to handling very large amounts of data. – Do not attempt to govern all data. Writing policies that govern all data in a big data environ- ment, not to mention enforcing such policies, is an enormous task. SHARE Reference & Master Data SHARE Metadata SHARE Data Dissemination SHARE Subtotals GRAND TOTALS Data Life Cycle Management Component Focus Area # Low # Moderate # High CREATE Data Collection CREATE Data Modeling & Design CREATE Subtotals STORE Data Architecture STORE Data Storage & Operations STORE Data Security STORE Data Quality STORE Data Governance STORE Data Integration & Interoperability STORE Subtotals USE Data Warehousing & Business Intelligence USE Data Development USE Data Analytics USE Subtotals SHARE Document & Content Management Table 20. Self-assessment summary.

Supporting Tools 93 – Focus on policies for privacy intensive, security sensitive, and compliance sensitive data. This will direct governance efforts to where they will have the most impact. – Use automated methods to classify data. This can help identify what data are most impor- tant to govern, a process for which manual approaches often prove to be unfeasible. – Consider a govern­at­access approach. This approach determines the permissions of any given user at the time they attempt to access it, allowing for more flexibility, reactivity, and scalability than manually establishing such access beforehand. – Automatically detect and flag suspect access patterns. Attempts to access sensitive data on an unrecognized device or during unusual hours should be treated as suspicious. • Cloud data governance—Governance of data on centralized cloud-based storage architectures. – Do not rely on physical server separations to enforce data governance. In a cloud-based data lake environment, all data reside in a central location with unified data governance. There- fore, all data policies must consider all data users since there are no physical separations or data siloes segregating user access. – Review national and local regulations. Some regulations have a direct impact on how cloud storage can be used. – Know the physical location of where cloud data are stored. This may have an impact on what data regulations apply. – Understand how governance is enforced by cloud partners. If a partner’s implementation of data governance is insufficient, a new service provider can be sought. • Next generation data governance—Horizontal governance rather than hierarchical governance. – Build a governance community. – Focus on proactive prevention and real­time intervention. Ideally, enforcing data governance rules after the fact should be a last resort to use only if prevention and intervention efforts have failed. – Embrace minimalist policymaking. A small number of important policies is more scalable and interpretable than a large, complex collection of minor policies. Frameworks for big data governance have been developed to guide the transition of orga- nizations from traditional data governance to more modern data governance by decompos- ing and structuring the new data governance goals and objectives. Figure 14 presents one of these frameworks. The stated goals of this big data governance framework are to protect personal information, preserve the level of data quality, and define data responsibility (Kim and Cho 2018). The IBM Information Governance Council Maturity Model, represented in Figure 15, estab- lishes a multi-level process for organizations to migrate from traditional data governance to next generation data governance (Soares 2018). The model includes setting goals associated with clear business outcomes that can be communicated to executive leadership; ensuring “enablers,” including having the right organizational structure and awareness to support data stewardship, risk management, and policy; establishing core disciplines, including data quality management, information life cycle management, and information security and privacy; and finally establishing the supporting disciplines of data architecture, classification and metadata, and audit information, logging, and reporting. As an organization becomes more able and develops capabilities within the core and supporting disciplines, they progress further toward more modern data governance. Roles The commoditization of data and data analysis tools has fostered the adoption of self- service data preparation and analysis, where data tasks that were traditionally handled by an

94 Guidebook for Managing Data from Emerging Technologies for Transportation Figure 14. Big Data Governance Framework (Kim and Cho 2018). Available for use under the Creative Commons License: https:// creativecommons.org/licenses/by/4.0/. This image has been re-created to aid readability. Figure 15. IBM Information Governance Council Maturity Model (Soares 2018).

Supporting Tools 95 expert statistician or data analyst are now performed directly by a variety of end users using visual and code-less tools requiring less technical expertise. To accommodate this move toward a distributed use of data, a distributed form of data governance has been adopted by many organizations. This approach builds on the concept of data governance roles, adding new roles that best support an expanded community of data users. Following is a brief list of data governance roles to be found in a distributed data governance model: • Traditional roles – Data Owner—Responsible for data access and administrative controls – Data Steward—Responsible for data quality and meaning – Data Custodian—Responsible for IT tasks and technical controls • Additional roles for distributed data – Data Curator—Responsible for cataloging and describing data sets – Data Coach—Responsible for training and assisting data users Within the literature, some agencies advocate for tracking additional data governance roles, such as data sponsors, data users, and data stakeholders (Wells 2019). Some agencies may find tracking additional roles, or even inventing new ones, to be useful. The key is to maintain a clear record of specific responsibilities without adding so many roles as to create unnecessary confusion or overhead. For most organizations, especially those organizations that are adopt- ing a distributed data governance approach for the first time, it is recommended to begin by focusing only on roles that are well known and well defined in the literature. Data Governance Tracking Tool To assist with assigning and tracking data governance roles two template forms are included. The first one, the information gathering form (Table 21), is best used when determining roles for a given data set. This form begins with basic identifying information, including the name of the data set, its logical storage address, a description of the data, and what potentially sensitive information the data set contains. It then provides a space that lists each data role, provides a description of the role including what personnel typically take on that role, and a space to put the name of the organization member in that role. The second one, the information cataloging form (Table 22), collects the data roles for each data set and condenses them into a single spreadsheet. This format allows executives or data team members to see at a glance all of their data sets and the relevant personnel asso- ciated with each. Data Sources Catalog Tool It is strongly recommended that transportation agencies periodically assess what data sources are in use and what data sources are available to be used. Not only does assessment help prevent an agency from overlooking data sources that could be vital to current or future projects but it also provides a better understanding of how data sets are connected to support the creation of a metadata catalog, planning for storage, development of new data pipelines, and better organization of an agency data lake structure. Maintaining a detailed catalog of data sources is one of the first and best ways to understand the nature of an agency’s data and guide the development of the data analytics processes that can be built on it.

96 Guidebook for Managing Data from Emerging Technologies for Transportation Data Name Live Traffic Feed Data Location Z:/DataLake/LiveFeeds/Traffic_XML/ Data Description XML data pulled from roadside sensors every 10 seconds Data Sensitivity No sensitive information or PII Data Governance Roles Name of Role Description of Role Personnel Filling Role Data Owner Exercises administrative control over the data. Concerned with risk management and determining appropriate access to data. This role is typically filled by the most senior executive within the division that controls, created, or most often uses the data. Data Steward Ensures the quality and fitness of the data. Concerned with the meaning and correct use of data. This role is typically filled by a division SME with domain knowledge relevant to the data or by a member of the data team. Data Custodian Exercises technical control over the data. Concerned with implementing safeguards, managing access, and logging information. This role is typically filled by IT personnel, such as system or database administrators. Data Curator Manages the inventory of data sets. This includes cataloging the data, maintaining descriptions for the data, and recording the data utility. This role is typically filled by senior IT personnel or by a member of the data team. Data Coach Collaborates with business data users to improve skills and promote data utility. This role is typically filled by a member of the data team or by a data SME within a division. Table 21. Information gathering form. Provided herein is an example of how to structure a data sources catalog to summarize the specificities of each data source into a single table (see example in Table 23). To best review and assess the needs of available data sources, each data source is represented by its own row, with columns briefly describing the various facets of the data source. Should detailed informa- tion be needed, it is recommended to add this information to an appendix to preserve read- ability of the catalog. Below is a list of facets (table columns) that could be used to describe each data source, along with example entries for each: • Data Source—The name of the data source. – Examples—“Local PD incident data,” “Traffic light sensor data,” etc. • Description—Additional distinguishing details about the data source. – Examples—“Traffic incident performance measures from 2015 onward,” “signal data from all intersections in the downtown area,” etc. • Ownership—Who owns or ultimately controls the data. This could be recorded as simply internal versus external or the actual owners can be listed by name. – Examples—“internal,” “external,” “Vendor A,” “FHWA,” etc. • File format—The format that the data are provided in. This is typically an open or closed file format but can also be an API or online dashboard when working with third party vendors. – Examples—“csv,” “xml,” “json,” “pdf,” “API,” “web-based report,” “proprietary data format,” etc.

Table 22. Information cataloging form.

Data Source Description Ownership Format Size Cost Security Level Granularity Restrictions Update Frequency Projects Last Reviewed Waze Incidents Traffic speeds based on global positioning systems probe data Internal XML 2.1 TB total $70,000 /year Proprietary Predefined roadway segments Cannot share without permission 1 minute Work Zones, Signal Timing 03/12/2019 Snowplow AVL Probe data from snowplows Internal REST API 4 TB total $4 /truck No PII 0.01 mile point None 1 minute DOTPJ, Work Zones 01/15/2019 CoCoRahs Certified crowdsourced weather reports CoCoRahs Network XML 380 MB total Free No PII Interpolated from number of reports None 24 hours SNIC, Possibly DOTPJ 04/03/2019 Incident Reports Individual incident reports collected from participating local agencies Internal CSV 500 MB total $15 /month Sensitive 1 row = 1 incident None Monthly batch upload A-110, possible use in A- 123 02/22/2019 Table 23. Data source assessment example.

Supporting Tools 99 • Size—How much capacity is required to store the data? This can be represented in terms of total storage used and/or how much additional storage is required per month depending on the nature of how the data source. – Examples—“10TB total,” “5GB per month,” “300MB daily,” “600MB + 50M per month,” etc. • Cost—How much does it cost to use the data? For external data sources, this is simply the amount charged by the vendor. For internal data sources, this number represents various upkeep costs to process and manage the data. – Examples—“$500 per month,” “$15 on average per day,” “$5,000 upfront and $250 per month for 4 years,” etc. • Security Level—The level of security called for by the data. The exact classifications may vary between organizations but, at a minimum, should include four basic levels. They are “PII” to identify the presence of PII that must be anonymized; “PII-possible” to identify data that is not identifiable by itself but may become identifiable if combined with other data; “sensitive” to identify data that otherwise requires special attention or care; or “standard” for data that only calls for the standard level of security and encryption. – Examples—“PII,” “PII-possible,” “sensitive,” “standard,” “top secret,” “secret,” “confi- dential,” etc. • Granularity—How granular or specific are the data? Typically, the lowest level of granu- larity is having each individual item or event represented by one row or record in the data. As data are aggregated, each row/record may represent a group of many individual items/ events, which affects how the data can be used and combined with other data sets. – Examples—“1 row per incident,” “1 row per city block,” “10 families per record,” “incidents aggregated within 2-mile road segments each hour,” etc. • Restrictions—What restrictions are in place in how the data can be shared or used? Most commonly, these restrictions are found with external data sources whose contracts limit how the data can be used. However, these can also apply to internal data sources that use proprietary or restrictive file formats. – Examples—“cannot distribute,” “no access to raw data,” “proprietary data format,” “only usable with software from Vendor A,” “limited to authorized users only,” etc. • Update Frequency—How often the data are updated. Most streaming data are updated in near real time, while non-streaming data may or may not have a set update schedule in place. – Examples—“true real time,” “near real time,” “monthly,” “weekly,” “daily,” “hourly,” “upon request,” “no longer updated,” etc. • Projects—A list of current or potential projects for which this data source could be useful. Very large organizations with many projects may find it helpful to separate this into two dif- ferent columns for better visibility: one column for projects that currently use the data source and another for projects that could potentially use the data. – Examples—“In use by Project A,” “potential use for Project B,” “evaluation in progress for Projects C and D,” “vital component of System A,” “necessary for monthly newsletter,” etc. • Last Reviewed—The date when this data source was last reviewed. This field is useful both for timing regular reviews as well as for identifying at a glance whether new projects were created before or after the most recent data source review. A new project being created that could potentially use a data source is a good reason to perform a new review of that data source. – Examples—“2019-10-01,” “Q3 2019,” “05/10/2019,” etc. Frequently Asked Questions or FAQs Q. What exactly is big data? A. Big data is more than a catch phrase. At its core, big data is a set of concepts and methodolo- gies that allows for the storage, processing, management, and analysis of extremely large,

100 Guidebook for Managing Data from Emerging Technologies for Transportation diverse, and fast-changing data sets. As these data sets differ greatly from traditional data sets in terms of their volume, variety, and velocity, they require new and powerful ways of dealing with the data. Multiple definitions for big data are provided on page 15 of the guidebook. Table 1 on page 7 of the guidebook contrasts the fundamental differences between the traditional data systems and management approach of most transportation agencies and the modern big data approach that is needed to effectively manage data from emerging technologies. Q. Why do we need big data? A. Data from emerging technologies have tremendous potential to offer new insights and to identify unique solutions for delivering services, thereby improving outcomes. However, the volume and speed at which these data are generated, processed, stored, and sought for analysis is unprecedented and will fundamentally alter the transportation sector. With increased connectivity among vehicles, sensors, systems, shared-use transportation, and mobile devices, unexpected and unprecedented amounts of data are being added to the transportation domain, and these data are too large, too varied in nature, and will change too quickly to be handled by traditional database management systems. As such, modern big data methods to collect, transmit/transport, store, aggregate, analyze, apply, and share these data at a reasonable cost need to be accepted and adopted by transportation agencies if they are to be used to facilitate better decision­making. Q. Do other local or state agencies use big data? A. Yes, and you can learn from their experiences. This guidebook references several agencies that have successfully transitioned or applied big data architectures and methodologies in different capacities with differing levels of success. Q. How will this guidebook help my agency? A. This guidebook provides guidance, tools, and a big data management framework, including more than 100 recommendations, and lays out a roadmap for transportation agencies on how they can begin to shift—technically, institutionally, and culturally—toward effectively managing data from emerging technologies. The guidebook will help transportation agen- cies identify a jumping-off point for managing big data, as well as a step-by-step process for gradually/incrementally building toward organizational change. Figure 1 and the associated discussion on page 2 of this guidebook can help an agency understand where they might begin to apply the guidance and tools provided in this guidebook. Q. What is a data lake and how does it relate to big data? A. Simply put, a data lake is a location where raw, unprocessed data are stored in their native form and organized. The data will subsequently be accessed and used by various entities within an organization. Data lakes are simple and similar to a very large folder structure where data files are collected. They are meant to store data as long as possible and at a low cost, allowing for the collection of all generated data and the creation of very large data archives. Data stored in data lakes are available to data users in a read-only format to help guarantee that the original data will never be altered or modified. Data lakes also allow data to be used by many users at once even if they use very dif- ferent analytical tools. This is a contrast from traditional data workflows, where require- ments define what data should be collected and how data should be modified and stored in order to support predefined analysis tasks. When using data lakes, agencies are now able to capture and store raw and unfiltered data and then explore the data and develop multiple use cases to support different areas of the organization. Indeed, within a raw data set stored in a data lake, cleaned data from the data set may be of interest to one

Supporting Tools 101 business unit, outliers from the same data set may be of interest to another, and only a few fields from that data set may be of interest to another. The data lake allows for each business unit to use the same raw data independently from each other and shape them to the specific needs of their applications, business intelligence tools, and/or static reports. Q. What skills are required for big data? A. There are typical skills required for big data. The skills are knowledge in programming (Python, Java, Scala, or Go); modern data warehousing (data lake management); big data computation frameworks such as MapReduce, Hadoop or Apache Spark; knowledge of statistics and linear algebra (e.g., summary statistics, probability distribution, or hypothesis testing); and last, but not least, business domain knowledge to have a good understanding of what hides behind the data. The skill levels can vary greatly depending on the complexity of the big data analysis to be undertaken. Additional skills may also be needed, depending on an agency’s approach to big data. If the big data solution is implemented on premise, it will require a much higher level of technical expertise than a cloud solution implementa- tion. On premise implementations require expertise in the development and management of the hardware and software of very large server clusters, and this expertise is not easily found or affordable. As such, on premise big data implementations are known to be dif- ficult, tedious, time-consuming, and costly to implement. Cloud implementations do not require the acquisition of such skills, as they fall under the responsibility of the cloud provider, leaving agencies with only the need to acquire big data expertise. Q. Can I do this on premise? A. Yes, but this approach is not advisable. Not only does an on premise implementation require a great deal of expertise but also an on premise implementation generally requires mul- tiple people to implement and maintain. System administrators will need to know how to administer a very large cluster of commodity servers and deal with constant failure and optimization. Not only will developers need to know languages such as Java and Scala and a variety of distributed computing frameworks such as Kafka, Apache Spark, and Hadoop, developers will also need to know how to tune them so the performance of processing jobs they developed remains acceptable. Instead, it is a recommended best practice for big data to use cloud services and architecture. Q. Why would I move to the cloud? A. Cloud solutions were developed to allow organizations to benefit from large computing resources without having to bear the cost on their own. Indeed, big data projects could far exceed the organization’s entire annual budget if developed and managed on premise. This is because big data projects require large bursts of computing power for short periods of time which, when implemented on premise, lead to the design of very large clusters that are seldom used to their full potential. Cloud solutions solve this problem by adopting a shared server cluster model to maximize its use. To allow the use of this shared cluster of servers, cloud providers have done the heavy lifting to make the accessibility of data storage and processing available through automation or easy to use APIs that can be leveraged with more common scripting languages such as Python. This greatly reduces the amount of time and resources spent on maintaining technology and allows for more time and resources to focus on deriving a better understanding of the organization and its operations from the data. Q. When would I need to use machine-learning algorithms? A. Machine-learning algorithms are a subset of data science techniques that use a multi-step machine-learning process to perform advanced analyses. Because these applications rely on a huge amount of unstructured data to be effective, they are not something that most

102 Guidebook for Managing Data from Emerging Technologies for Transportation transportation agencies will need to concern themselves with until they have built a very mature set of big data management approaches. It will be more effective for agencies to focus first on using the guidance in this document to collect large amounts of data that are properly cleaned, stored, enriched, analyzed, and visualized before diving into deep learning. That said, once a strong data management foundation is in place and appropriate data have been acquired, deep learning algorithms can be used to support computer vision applications, classify existing data, and predict the attributes of future data. Computer vision, in which a machine can be trained to distinguish and classify objects in image and video files, can be useful in turning roadside camera recordings into traffic observations or passenger records. Q. What is data governance? A. DAMA Dictionary of Data Management defines governance as “the exercise of authority, control, and shared decision making (e.g., planning, monitoring, and enforcement) over the management of data assets” (DAMA International 2011). Data governance is a collection of practices and processes that help to ensure the formal management of data assets within an organization, including the planning, oversight, and control over management of data and the use of data and data-related resources. Data governance puts in place a framework to ensure that data are used consistently and consciously within the organization. Data governance also deals with quality, security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (Roe 2017). Q. What is big data architecture? A. Big data architecture is the overarching system definition that an organization uses to build its big data environment and steer its data analytics work. Big data architecture is the foundation for a big data environment and consists of four logical layers: • Big data sources layer • Data massaging and storage layer • Analysis layer • Consumption layer In addition to the logical layers, four major processes operate cross-layer in a big data environment (Taylor 2017): • Data source connection • Governance (privacy and security) • Systems management (large-scale distributed clusters) • Quality control Q. When do you know it is time to begin working with big data? A. For many agencies, the best time to start working with big data was probably a decade ago; the second best time to start working with big data is today. There are two reasons to begin adopting modern data management practices: when new approaches will reduce costs or improve efficiencies or when a new use case or application is identified that requires them. Often, migrating from siloed data storage to a cloud-based data lake alone will result in enough workflow improvements and cost reductions to make the pursuit of big data management worthwhile. Any agency that is unsure if it is the right time to modernize data management approaches may be well served by having a small team review current practices to identify

Supporting Tools 103 potential cost savings from adopting new approaches. This same team may also review available data sets or recent big data-enabled achievements from their closest peers to see if they could benefit from pursuing new data products. The accompanying data management capability maturity self-assessment (DM CMSA) tool can be useful in identifying areas of improvement in data management, while the data sources catalog tool can be useful to identify the potential of new and existing data sets. Q. How do we ensure that the third parties we work with are keeping the data secure? A. When working with third-party data providers, the best way to ensure strong data security practices is to incorporate clear requirements into the negotiated contract. These require- ments should be flexible enough to accommodate for updated technology while including specific requirements that leave no room for ambiguity or loopholes. The agency must also have some means of monitoring for non-compliance so that these requirements can be effectively enforced. When working with cloud service providers, an agency may not have the leverage or opportunity to include data security enforcement clauses in the contract. If this is the case, then the next best option is to fully research and understand the standard security measures used by the provider. Fortunately, due to the economies of scale and the vital importance of trusted data to their business model, nearly all major cloud service providers maintain standard levels of security that exceed those found at most transportation agencies. Q. Some data management frameworks include a step where data are destroyed, where old or seldom-accessed data are deleted to preserve space in the system. Why is that step not included in the data management framework in this guidance? A. Traditionally, there was a focus placed on managing free space on a data storage system in order to avoid unnecessary costs. One of the benefits of modern data management approaches is that managing free space on a server is automatic. For example, most modern cloud storage providers will monitor how frequently data are accessed and automatically migrate data that are seldom used to archival storage in a process that is transparent to the end user. This obviates the need to hire data maintenance workers to manually monitor and move data from active storage to archival storage. Furthermore, the costs and benefits of storing unused data have changed. In traditional data storage models, the server costs are the same whether the data are accessed or not. In modern use-based fee models that are common among cloud providers, any data that are seldom accessed are less expensive to store, making long-term storage of unused data more feasible. With the advent of big data analytical techniques that can turn large amounts of seemingly uncorrelated data into actionable insights, the potential value of collected data is generally higher as well. Because data are generally more useful to retain, less expensive to store, and easier to archive, there is no longer the need to expend as much energy on purging data. There are exceptions to this, where data are sufficiently large or expensive that long-term storage becomes unfeasible. These exceptions are rare enough that it is advisable to preserve data as much as possible and any thought given to destroying data is not sufficient to merit the inclusion of a “destroy” step in the framework. Q. What does it mean to be data driven in a big data environment? A. To be data driven means that progress in an activity is compelled by the data itself, not by intuition, personal experience, or political agenda. While transportation agencies have been using data to make “informed” decisions for many years, big data are too large, fast, and change too quickly to be processed and understood by humans for informed decision- making. Through the processing of integrated and complex data sets, big data methodologies

104 Guidebook for Managing Data from Emerging Technologies for Transportation can provide decision-makers with more detailed, intricate, and timely outputs from which to base their decisions, which simply cannot be offered with the siloed nature of transportation agency data today. Q. Is the flood of big data really coming? A. It is already here! Private industry has been using and generating data at an increasingly rapid rate for several years now. Some transportation agencies may be waiting for the flood of data to hit them before they pursue modern data management practices. It is commonly the case that agencies do not recognize or pursue a new data opportunity and thus are never forced to act; they simply lose the chance to exercise control over the data as private industry fills the gap. One example of this is online 511 systems. Private sector offerings like Waze and Google Maps have exceeded the capabilities of most, if not all, public transportation agencies. Had these agencies developed high-quality online 511 systems in the first place, they could have exerted beneficial control over them, such as not posting the location of law enforcement vehicles that can affect the safety of officers. Now that the market share has become domi- nated by private sector offerings, that window of opportunity may be difficult for any agency to re-open. If agencies do not invest in their own data systems, they will be forced to pay third-party vendors if they ever want to use emerging technology data. These vendors may feel free to charge whatever rates they choose if they perceive they are working with an agency that has no other options available to them. Q. Who determines the quality of the data? A. When obtaining data from third-party sources, it is recommended to include minimum expected levels of data quality in the negotiated contract, along with clear repercussions for failing to meet these expected data quality levels. The contracting agency should then employ some base level of in-house data expertise to where the agency can independently verify the quality of the data coming in. With such an arrangement, the third-party data provider is responsible for providing high quality data while the contracting agency takes on responsibility for validating the quality of the data coming in. When dealing with internal data the dynamics are the same: the data creator or data pipeline owner is responsible for the quality of the data, which is independently verified. Effective data quality verification processes employ both an automated validation process and periodic manual reviews. Because different applications require different levels of data quality, it is recommended that low-quality data be flagged with a data quality score rather than discarded entirely. This allows data analysts to make an informed decision as to which data sets are of sufficiently high quality to be included in any given analysis. Q. What can be done if an agency is unable to obtain any big data? A. Even if an agency does not have the inclination or resources to pursue large data sets imme- diately, there are still benefits to modernizing data management practices. Advice contained in this guidebook relating to eliminating data siloes, implementing procedures for data quality management, and creating effective practices for data product development can be useful even when only working with smaller, more traditional data sources. Following such guidance wherever it is relevant may not only improve the handling of traditional data sources but will also help prepare an agency for managing big data if the agency decides to pursue such data sets in the future. Q. What is the value proposition in sharing data with others? A. Most of the impetus behind open data policies and sharing data with external stake holders is to foster innovation and promote development of data products that may or may not

Supporting Tools 105 directly benefit the agency sharing the data. That said, there are situations where cost sharing may be appropriate. If two or more agencies collaborate on a single project, they may agree to equally contribute to the project’s development. There may also be situations where a transportation agency that is providing data to a partner may gain compensation or control from the arrangement. For example, one trans- portation agency shares data with private mobility on-demand companies through a live API; however, that API will not respond to location requests sent inside public parks. By sharing data in this way, the agency was able to gain a measure of control over private industry behavior that benefited the public they serve. Q. What formats and standards are used when sharing data? A. Using open source data formats is strongly recommended whenever data are shared across internal applications or shared with external users to ensure that the data can be applied to multiple use cases without transformation. For example, the XLSX format is specific to the Microsoft Excel application, so spreadsheet data shared in this format will only be acces- sible to users who have licensed Microsoft Excel. If that spreadsheet data are instead shared in the open source CSV format, then the data can be accessed by far more applications and therefore reach a wider audience. Other examples of commonly used open source data formats include JSON, GeoJSON, and KML. Several successful open data platforms share data in multiple data formats, allowing the users to select the format that works best for them. Consideration should also be given to the audience of the data and the intended use. Data aimed at a non-technical audience for manual review may be best delivered as an inter active web-based visualization, where data provided to partner companies for real- time application queries are better served through an API. Regardless of the delivery system that is used, the associated metadata and retrieval processes ought to be documented as well as possible to minimize the number of questions that must be fielded by department personnel over the lifetime of the data-sharing system.

Next: Works Cited »
Guidebook for Managing Data from Emerging Technologies for Transportation Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With increased connectivity between vehicles, sensors, systems, shared-use transportation, and mobile devices, unexpected and unparalleled amounts of data are being added to the transportation domain at a rapid rate, and these data are too large, too varied in nature, and will change too quickly to be handled by the traditional database management systems of most transportation agencies.

The TRB National Cooperative Highway Research Program's NCHRP Research Report 952: Guidebook for Managing Data from Emerging Technologies for Transportation provides guidance, tools, and a big data management framework, and it lays out a roadmap for transportation agencies on how they can begin to shift – technically, institutionally, and culturally – toward effectively managing data from emerging technologies.

Modern, flexible, and scalable “big data” methods to manage these data need to be adopted by transportation agencies if the data are to be used to facilitate better decision-making. As many agencies are already forced to do more with less while meeting higher public expectations, continuing with traditional data management systems and practices will prove costly for agencies unable to shift.

Supplemental materials include an Executive Summary, a PowerPoint presentation on the Guidebook, and NCHRP Web-Only Document 282: Framework for Managing Data from Emerging Transportation Technologies to Support Decision-Making.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!