National Academies Press: OpenBook

Improving Management of Transportation Information (2013)

Chapter: Part 1 - Terminology and Categorization Standardization

« Previous: Introduction
Page 4
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 4
Page 5
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 5
Page 6
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 6
Page 7
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 7
Page 8
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 8
Page 9
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 9
Page 10
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 10
Page 11
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 11
Page 12
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 12
Page 13
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 13
Page 14
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 14
Page 15
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 15
Page 16
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 16
Page 17
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 17
Page 18
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 18
Page 19
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 19
Page 20
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 20
Page 21
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 21
Page 22
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 22
Page 23
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 23
Page 24
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 24
Page 25
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 25
Page 26
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 26
Page 27
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 27
Page 28
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 28
Page 29
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 29
Page 30
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 30
Page 31
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 31
Page 32
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 32
Page 33
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 33
Page 34
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 34
Page 35
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 35
Page 36
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 36
Page 37
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 37
Page 38
Suggested Citation:"Part 1 - Terminology and Categorization Standardization." National Academies of Sciences, Engineering, and Medicine. 2013. Improving Management of Transportation Information. Washington, DC: The National Academies Press. doi: 10.17226/22504.
×
Page 38

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4The information produced, managed, and used by transportation professionals has undergone a transformation from primarily static narrative documentation to dynamic databases used to produce reports and visualizations. GIS have heavily affected transportation information— transportation infrastructure and use is location-based, so geospatial visualization appeals to transportation data users. Narrative reports and other types of documents are still important, but most of these are produced digitally, in serial versions that need to be managed and may need to be synchronized with the related data sources. These factors have created new challenges for information professionals to identify, describe and manage transportation information items so that they can be found and used by transportation professionals to accomplish their daily tasks. Part 1 provides an overview of the current transportation information management land- scape, and makes recommendations related to • Developing a common categorization scheme for transportation information management and identifying enhancements in detail or scope of information that should be included in such schemes. • Strategies for developing a common terminology and categorization scheme that could be made available for use by state DOT. Part 1 also discusses practices from other fields that may be adapted for improving DOT man- agement of transportation information (e.g., guidance for file formats, naming conventions, and information preservation strategies). Types of Transportation Information That Need to be Managed DOTs are responsible for numerous reports and considerable information and data, includ- ing project information, systems conditions/performance data and reports, research reports, administrative information and inventories. A common way to assess the types of informa- tion an organization needs to manage is to consider the key business functions or activities of the organization and the types of information produced or consumed in accomplishing each function. Exhibit 1-1 illustrates how the scope of transportation information created as part of common DOT business functions includes data or data-generated information formats such as computer-aided design (CAD), GIS and other computer-generated graphics. Often these data- based visualizations are generated for use as part of a document or presentation as a document. Maximize Information Use and Value DOT information is created to support a business function or activity, although the intended use of the information may not always be readily apparent. It is even more difficult to anticipate and envision future secondary and potential tertiary uses of that information. For example, P A R T 1 Terminology and Categorization Standardization

Part 1—Terminology and Categorization Standardization 5 information may be generated as part of an immediate operational activity such as accession- ing assets,1 which is part of the DOT asset management function. Later that same information may be analyzed to produce an asset maintenance plan. If DOT information is to realize its full value as a resource, it must be created and maintained in a form that will support such primary, secondary, and tertiary activities. To maximize the value of DOT information, consider how information can be structured to maximize its potential uses. Also consider when it is appro- priate to archive and/or purge it from an active collection. In order to maximize the usefulness of DOT information, it is necessary to consider the key information management activities in the transportation domain. Consider the primary purpose of each application and how it acquires, organizes, retrieves, secures, and maintains information to support specific objectives. The Relationship Between Data and Content Management The specific types of resources that transportation data and information applications have handled have changed over time. Prior to the mid-1980s, most information was in databases and was primarily numerical in its basic format. Although word processing was available prior to the adoption of the personal computer (PC), this was primarily a central- ized operation organized along the lines of a typing pool. As the PC emerged, word processing became ubiquitous and the production of text files and presentations led to an explosion of documents. The emergence of information networks and widespread adoption of email and the web accelerated the trend; today social media continues to drive the growth. These trends Functions Information Types Formats D oc um en ts D at a Ta bl es C A D G IS G ra ph ic s Project Information Engineering (e.g., Drawings) Specification Performance Test Planning Study Technical Reports (e.g., Materials Research) Environmental Report System Condition/ Performance Traffic Data Safety Data Performance Report Research Research Report Administration Financial Contact Information Inventory GIS Data Asset Inventory Database Exhibit 1-1. Types of transportation information by function. 1According to AASHTO, “Transportation Asset Management is a strategic and systematic process of operating, maintaining, upgrading, and expanding physical assets effectively throughout their lifecycle.” Accessioning assets is the process of adding a new asset such as streets, signs, curbs, gutters, rights-of-way, etc. into an asset management system. See FHA Asset Manage- ment Overview, December 2007 (http://www.fhwa.dot.gov/asset/if08008/assetmgmt_overview.pdf).

6 Improving Management of Transportation Information have motivated an evolution in thinking about transportation information, from data manage- ment to content management. Data management comprises processes and technologies for collecting and managing data so that it can be used to accomplish tasks effectively. For transportation information management, we are primarily concerned with (1) ensuring the quality of authoritative data, (2) identifying and providing appropriate access to authoritative data to accomplish business functions, (3) generat- ing visualizations or analyses based on processing data, and (4) synchronizing visualizations and narratives with source data. A key aspect of data quality is to ensure that data values exist and that they are consistent. Other aspects of data quality include accuracy, timeliness, validity, and com- pleteness. When data quality needs to be maintained across data sources, extra work is required to obtain a set of consistent values (e.g., for the names of organizations such as government agencies or contractors). Data values that have been assembled and mapped from across multiple data sources are called reference data. Alternately, a common set of data values (e.g., ISO 3166) may be used to identify countries. Content management encompasses processes and techniques for collecting, managing, and publishing information so that it can be found and used to accomplish tasks effectively. For transportation information management, we are primarily concerned with (1) identify- ing and providing appropriate access to authoritative versions of content items, (2) providing adequate descriptions of content items so that they can be found and used, (3) linking to data sources for visualizations and analyses included in narratives, and (4) linking to related content items. Effective content management requires the generation and use of complete and consistent metadata values. Although system-generated metadata (e.g., unique identifiers and last-modified dates) are readily available, descriptive metadata (e.g., topic) frequently does not exist because it is not required by organizational business processes. With full-text search readily available, the value of author-generated metadata is not always considered worth the effort. Content Structure Although the definitions of data and content management are aligned, the types and formats of information being managed are different. Given that most information forms today include structures amenable to automated processing, it is more useful to describe content items using a continuum of more to less structured. Every type of content item has some data or metadata associated with it. However, data management applications are typically structured to answer types of questions such as “What is the current balance in a program fund account?” It is much more difficult to ask “What was the balance in a program fund account a year ago?” Similarly, computer file servers can reveal the last-modified date of a piece of content, but it is usually not possible for them to reveal the effective date of a piece of content—unless a human editor has added that information. For example, the effective date of a regulation is normally different from the date of the legislation, or the date of an announcement in the Code of Federal Regulations (CFR). Processing Another trend is the processing of content to identify meaningful patterns (e.g., identify- ing patterns among the words and phrases in the text and extracting named entities, such as locations, organizations, or people mentioned in the text) and presenting these patterns using

Part 1—Terminology and Categorization Standardization 7 some form of visualization such as GIS maps. Analytics is the processing of content into a data representation, where all types and forms of content can be reduced to a set of data values. Managing content across an enterprise encompasses both data and content management. Sometimes this is referred to as enterprise content management and sometimes simply as data management. Linking Source Data to Published Analysis in Documents An important challenge is managing heterogeneous content (i.e., narrative content), which may be based on structured data sets and include visualizations of that data. Providing dynamic methods to directly link narrative content to such source data is becoming necessary. It is no longer sufficient to manage such narrative content simply as a static content item. For example, a research report on highway safety that includes tables of data, charts, and maps needs to be linked explicitly back to the data sources so that further analysis of the same data set can readily be replicated, or new analyses performed. Lifecycle, Workflow, Archiving A content set will typically evolve through drafts and versions and will often have associ- ated annotations and commentary. Today’s information manager must manage and synchro- nize multiple versions of overlapping sets of heterogeneous sources. For example, a PowerPoint report on material properties of highway surfaces will typically be developed through many drafts and versions for different audiences (e.g., engineers and budget analysts). The manager must keep track of the multiple versions and determine which is the most current or which is the official document of record. Although this is a difficult task, it can be addressed with ver- sioning software available in document management systems. Metadata Standards Metadata standards provide basic guidelines to ensure common description of content so that it can be found and used within and across applications, repositories, and organizations. Metadata should be associated with all types of content items, including documents, data sets, and visualizations. Metadata may be embedded in the content item or in a separate metadata database with identifiers to link the content item and the metadata database record. Metadata will be generated when the content item is created as well as each time the content item is used throughout its lifecycle. Ideally, metadata should provide a longitudinal record covering the life of the content item. Two ISO metadata standards are particularly relevant to transportation information: ISO 15836, Dublin Core; and ISO 19115, Geographic Information—Metadata. Another potentially relevant ISO standard is ISO 11179, Information Technology—Metadata Registries (MDR). ISO 15836 (Dublin Core) Referred to as the Dublin Core (which refers to Dublin, Ohio, the site of the meeting where this standard originated), ISO 158362 is the standard for describing content published on the web. This ISO standard defines 15 properties for use in resource description as shown in Exhibit 1-2. 2http://www.dublincore.org/documents/dces/.

8 Improving Management of Transportation Information Dublin Core properties can be expressed as HTML meta tags or as RDFa, an HTML exten- sion useful for marking and publishing metadata as linked data (i.e., publishing structured data so that it can be interlinked with data items from different data sources). Dublin Core has been widely adopted in government and business as the basic properties for describing content items. OMB Circular A-130 recommends that federal public websites use encoded Dublin Core metadata in the headers of their HTML pages.3 These properties are available in commercial software such as in Microsoft SharePoint (MS SharePoint) where it is represented as metadata columns. ISO 19115 and the Federal Geographic Data Committee (FGDC) ISO 19115 is the standard for describing geographic information and services. The FGDC endorsed this geographic information metadata standard in 2010. This common metadata standard for the description of GIS data sets has enabled the building of the National Spatial Data Infrastructure (NSDI) Clearinghouse Network. Geographic data, imagery, applications, documents, websites, and other resources have been cataloged for the NSDI Clearinghouse Network. ISO 19115 can be searched to find geographic data, maps, and online services. A description of the NSDI Clearinghouse Network from the FGDC Data and Services webpage follows: The Clearinghouse Network is a community of distributed data providers who publish collections of metadata that describe their map and data resources within their areas of responsibility, documenting data quality, characteristics, and accessibility. Each metadata collection is hosted by an organization to advertise their holdings within the NSDI. The metadata in each registered collection is harvested by the geo.data.gov catalog to provide quick assessment of the extent and properties of available geographic resources.4 DOTs can participate in the NSDI Clearinghouse Network to help meet their GIS needs and use their GIS resources. This is one way to use the adoption and use of the FGDC metadata standard. Core Property Metadata Grouping Purpose Subject Subject Answers the questions What, Where, and Why Type Coverage Date Use Answers the questions When and How Language Rights Identiier Asset Answers the question Who Creator Title Description Publisher Format Contributor Source Relational Provides links Relation Exhibit 1-2. Dublin Core properties. 3http://www.howto.gov/web-content/manage/categorize/meta-data. 4http://www.fgdc.gov/dataandservices.

Part 1—Terminology and Categorization Standardization 9 Open Data and Digital Government Initiatives Under the past two administrations, the White House has promoted information management good practices in U.S. government agencies that take advantage of the current and emerging networked information ecosystems. These initiatives have sought to improve customer service, efficiency, effectiveness, accountability, and transparency. Digital Government Strategy Early websites consisted of static HTML pages usually authored by hand. Web content man- agement (WCM) applications provide interfaces for authoring web pages and include features such as being able to preview what the HTML coding will look like before the page is published to a website. WCM applications also provide templates for creating different types of content and for publishing those pages with a particular design. Such templates can query a database or content repository to find content items to populate presentation templates, thus enabling an early version of dynamic web pages. This model of separating content creation from content publishing has become a key part of ECM strategies. The Digital Government Strategy is a conceptual model with three service layers: (1) the infor- mation layer, (2) the platform layer, and (3) the presentation layer.5 The information layer con- tains structured, digital information (e.g., traffic and highway safety data) as well as unstructured information (content) (e.g., fact sheets, press releases, and compliance guidance). The platform layer includes the systems and processes used to manage this information. The presentation layer defines how information is organized for presentation to users via websites, mobile applications, or other modes of delivery. These three layers, as illustrated in Exhibit 1-3, separate information creation from information presentation, thus allowing content and data to be created once and then used in different ways. 5http://www.whitehouse.gov/sites/default/files/omb/egov/digital-government/digital-government.html. Exhibit 1-3. Digital government services model.1 1Conceptual Model. In: Office of Management and Budget. Digi- tal Government Strategy (http://www.whitehouse.gov/sites/default/ files/omb/egov/digital-government/digital-government.html).

10 Improving Management of Transportation Information There are similarities between the Digital Government Strategy and the Transportation Knowledge Network (TKN) as described in NCHRP Report 643.6 The TKN vision is one of a portal enabled by information standards as well as communities of practice (COPs). The inten- tion is very much the same as the Digital Government Strategy but the architecture has been updated. Whether implemented as a portal, wiki, or service-oriented architecture, the inten- tion is the same—to enable wider access to transportation information at the local, regional, and national levels. Open Data For many years, publishing government data (both datasets and bibliographic metadata) was done primarily by commercial publishers, and database publishing continues to this day. With the advent of the web, public government data began to be more widely and freely published, but on a voluntary basis. The President’s 2010 Memorandum on Transparency and Open Government, OMB Memo- randum M-10-06, the Open Government Directive,7 made it a priority for the public to be able to easily find, download, and use datasets generated and held by the federal government. Data. gov was created as a catalog to provide descriptions of these datasets. The federal open data policy has greatly accelerated the trend to make datasets from all levels of government publicly available. Under the data.gov model, DOTs can develop applications based on the datasets they have for traffic, safety, and other areas of public interest or simply publish the datasets and let third parties develop those applications. Exhibit 1-4 is a dynamically generated traffic map made available by a third party based on the California DOT (Caltrans) data Transparency, Application Programming Interfaces (APIs), Third-Party Applications The Digital Government Strategy is intended to facilitate the development of services that use the information layer through the federal government open data policy and thus provide transparency. These services may be developed by the government agencies that produce the data, or they may be developed and deployed by other government agencies or other parties outside government. There are several architectures for doing this: (1) download the dataset and implement it as a stand-alone service on your own server, (2) dynamically query the information layer, or (3) link to the information layer. Dynamically querying the information layer can be done using web services description language (WSDL) or, if not supported, via an API or custom application. Linking to the information layer requires the agency to publish their information as a linked data service (i.e., publishing structured data so that it can be interlinked with data items from different data sources). (Linked data is the basis for the semantic web, which is discussed in the next section.) Making APIs available for web developers has been popular. The Google maps API is well known and widely used to represent data that has geospatial information on a Google map. All mapping applications have APIs to represent geospatial data using one or more of their map layers. Exhibit 1-4 is an example of a mashup8 which combines traffic data with a GIS map representa- tion. Several information services make APIs available so that developers can access and reuse data. For example, the New York Times makes APIs available for article search, Congressional information, location concepts, and many others.9 6http://onlinepubs.trb.org/onlinepubs/nchrp/nchrp_rpt_643.pdf. 7http://www.whitehouse.gov/sites/default/files/omb/assets/memoranda_2010/m10-06.pdf. 8A mashup is a web application that combines data from two or more sources and presents it in a single webpage. 9http://developer.nytimes.com/docs.

Part 1—Terminology and Categorization Standardization 11 The Semantic Web The term “semantic web” refers broadly to an extension of the World Wide Web that enables people to share content beyond the boundaries of applications and websites, the principles and practices underlying that extension, and the loose organization of people and institutions engaged in developing and institutionalizing those principles and practices. Underlying the semantic web is the idea that meaningful content—such as names of people or organizations, locations of events or resources, and dates, for example on web pages—can be tagged or presented in standard formats to facilitate the finding and presentation of the content. This semantic coding enables, for example, web browsers to recognize semantic content so that unstructured documents can be used like struc- tured data and linked dynamically to data sources and visualizations. The semantic web is giving rise to various tools and methods such as the Standard Generalized Markup Language (SGML), Resource Description Framework (RDF), and Extensible Markup Language (XML); some of these most likely to be relevant to transportation information management are described below. RDF Schema (RDFS) Vocabulary Description Language The Resource Description Framework (RDF) is a general-purpose language for representing information in XML on the Web. RDFS is an RDF extension used to describe groups of related resources and the relationships between these resources. A vocabulary description language is a way to represent the components of a schema and the relationships between these components. For example, Dublin Core or ISO 19115 can be represented in RDFS. Namespaces for XML Schema Elements and Attributes A namespace is a specification for unique identifiers for labels. A namespace disambiguates labels that are otherwise the same (e.g., homographs), with unique and referenceable identifiers. Exhibit 1-4. Traffic data map generated using web API (www.trafficpredict.com).

12 Improving Management of Transportation Information For example, an application could query a schema via a web service when it is necessary or useful to interact with an application that uses that schema. Exhibit 1-5 shows the Dublin Core namespace, which is used to discover the semantics and syntax of the Dublin Core elements. In this way, the semantics of XML schema elements and attributes are a type of vocabulary, i.e., a controlled list of values with a specific meaning and purpose. Namespaces for Category Values for Named Entities Other types of vocabularies that are important for the semantic web are “named entity” vocabularies. Named entities are the names of people, organizations, locations, events, topics, and other things with proper names or other specific, controlled names. These types of resources are named entity vocabularies. The RDFS vocabulary description language can also be used to describe the values in named entity vocabularies, the relationships among those values, and related values within a particular vocabulary or within another namespace. SKOS (Simple Knowledge Organization System) is the World Wide Web Consortium (W3C) specification for representing knowledge organization systems using RDF. SKOS make an impor- tant distinction that the discrete components of a categorization scheme (often called nodes in a categorization scheme) are “concepts” and that various labels and other types of information can be associated with a concept. Exhibit 1-6 illustrates an example of a concept—the name for the U.S. FHWA Research Library. The actual concept is an identifier, in this case the Library of Congress Name Authority File identifier for the U.S. FHWA Research Library (http://id.loc.gov/authorities/ names/no2012007308). Each lexical relationship—in this case, equivalent relationships— can be represented as a subject-object-predicate triple shown in the table at the bottom of the diagram. An SKOS representation can easily be extended to add information about a concept by adding another row in the triple table referenced to the concept’s identifier as the Subject. Term Name URI contributor hp://purl.org/dc/elements/1.1/contributor coverage hp://purl.org/dc/elements/1.1/coverage creator hp://purl.org/dc/elements/1.1/creator date hp://purl.org/dc/elements/1.1/date descripon hp://purl.org/dc/elements/1.1/descripon format hp://purl.org/dc/elements/1.1/format idenfier hp://purl.org/dc/elements/1.1/idenfier language hp://purl.org/dc/elements/1.1/language publisher hp://purl.org/dc/elements/1.1/publisher relaon hp://purl.org/dc/elements/1.1/relaon rights hp://purl.org/dc/elements/1.1/rights source hp://purl.org/dc/elements/1.1/source subject hp://purl.org/dc/elements/1.1/subject tle hp://purl.org/dc/elements/1.1/tle type hp://purl.org/dc/elements/1.1/type Exhibit 1-5. Dublin Core Metadata Initiative® elements namespaces.

Part 1—Terminology and Categorization Standardization 13 An emerging trend in library authority files and other types of authoritative lists of named entities (e.g., people, organizations, locations, events, and topics) is to publish them on the web using universal resource identifiers (URI). The most common URIs are uniform resource locators (URLs) or webpage addresses. URIs are unique identifiers on the web, so assigning URIs to authority records allows them to be referenced persistently on the web. The idea is to enable organizations to publish and reference named entities on the web. For example, FHWA could publish and maintain the authoritative list of FHWA agency names, programs, projects, and so forth. By doing this, the authoritative list of FHWA names would be available to DOTs as well as application developers—just as traffic and safety data is available for mashups. Review of Terminology and Categorization Schemes This section provides a brief overview of controlled vocabularies, which can be used to describe and categorize transportation-related content to make such content easier to find and use. Ways to Categorize Content Collections Working with digital content, users can either (1) browse for content using a file manager or (2) search for content using a local search engine. Browsing for content relies on how file directories and files themselves are organized and named. This method is often ad hoc—even when files are kept on a shared file store. Searching relies on how the internal search engine has been configured, including considerations such as what content has been indexed, whether indexing is full-text or metadata-driven, how search results are presented, and whether or not search refinements can be made to the query or the results. Exhibit 1-6. Some semantic relationships for the U.S. FHWA research library.

14 Improving Management of Transportation Information File Directory Methods The most common way to organize content is to put it in a physical directory. This is similar to the single access method used with paper files where content items are filed in folders in fil- ing cabinets. Everyone who has a PC has to decide how to set up file manager directories and folders. When there are shared network drives, a standard method of naming directories and folders is usually established. Folders are organized differently depending on the business activity. Examples are as follows: • Records management is based on a record retention schedule. These schedules are typically set up by business functions (e.g., accounting, administration, environment, finance, and human resources), then by content type (e.g., annual report, best practice, correspondence, datasheet, handbook, and form), and then by date. • Project management is based on work breakdown structure (WBS), usually by technical discipline, subdivided by task, and then chronologically. • Administrative files are often organized chronologically by date in alphabetical order (i.e., A-Z). Metadata Description Methods In addition to putting content in a file directory, it is helpful to associate descriptive metadata with the content item. Metadata supports content retrieval for authors and content managers so they can • Reliably find whether a content item exists, • Determine ownership of a content item and whether it can be re-utilized or not, • Enable alerts to new content or subscription to a pre-defined query, and • Keep content items current, accurate, and in compliance with regulations. Metadata supports content publishing and general use such as the following: • Faceted search based on metadata properties • Search optimization • Dynamic content delivery based on standard categorization • Content reuse in multiple distribution channels (e.g., web, mobile, and really simple server [RSS] alerts) • Content reuse in FAQs (Frequently Asked Questions) on specific topics and other categories • Orienting those searching on public websites (even when they land on a page 15 layers deep) • Ensuring consistent values for analytics across channels. Exhibit 1-7 presents various types of semantic schemes along a continuum based on the types of relationships that characterize the scheme. Generally, the schemes are arrayed from simple to complex relationships in terms of the difficulty in making these types of relationships: • Synonym Ring. A synonym ring is a set of words or phrases that can be used interchangeably for searching (e.g., fringe parking and park and ride). • Controlled Vocabulary. A controlled vocabulary is a list of preferred terms (e.g., a pick-list in a data entry form). • Taxonomy. A taxonomy is a system for identifying and naming things and classifying them according to a set of rules (e.g., a biological taxonomy or even most shopping websites which aim to arrange products according to a set of rules). • Classification Scheme. A classification system is an arrangement of knowledge usually enumerated, but that does not follow taxonomy rules (e.g., the Dewey Decimal System).

Part 1—Terminology and Categorization Standardization 15 • Thesaurus. A thesaurus is a tool that controls synonyms and identifies the semantic relation- ships among terms (e.g., Transportation Research Thesaurus (TRT)). • Ontology. An ontology is a faceted taxonomy, but uses richer semantic relationships among terms and attributes, and strict specification rules. Relevant Transportation and Related Terminology Resources This section discusses authoritative transportation-related terminology resources. The resources were found primarily by organic web searching. No libraries or specialized resources were consulted. Thus the resources listed are not exhaustive and are intended to be representa- tive resources. The research team considers federal, state and some industry associations to be authoritative and has not included resources whose provenance could not be verified. Transportation-Related Glossaries A glossary is an alphabetical list of terms in a specific subject area. Dictionaries usually have a broader scope. In the context of this project, glossary and dictionary mean the same thing. Glossaries typically include a definition for each term entry. Glossaries are often created to support the function of an organizational unit, a project, a policy initiative, compliance with a legislative mandate, or some other specific purpose. As such, glossaries are useful resources when building a terminology scheme for a subject area. They represent important concepts in the domain. Because glossaries have definitions, they are helpful in understanding the nuances that may be associated with certain terms and phrases in a discipline. Glossaries can also be a useful source of synonyms and quasi-synonyms; abbreviations, initialisms, and acronyms; and other term variants. Exhibit 1-8 lists authoritative vocabularies related to transportation in general and air, ground, rail, and water transportation more specifically. Exhibit 1-9 lists thesauri that are not specific to transportation. Transportation-Related Thesauri A thesaurus is a controlled vocabulary. A thesaurus that conforms to the Z39.19 standard includes equivalent, hierarchical, and associative relationships among terms and is intended Exhibit 1-7. Semantic schemes: simple to complex.2 2After: Amy Warner. Metadata and Taxonomies for a More Flexible Information Architecture.

16 Improving Management of Transportation Information to support the indexing and retrieval of documents. Several controlled vocabularies are cur- rently in use for transportation information. The TRT is used for indexing the Transportation Research Information Services (TRIS) Database. The TRT originated in research sponsored by the NCHRP in the 1990s and is designed to cover all modes and aspects of transportation. It is managed by TRB with input and content development done by the TRT Subcommittee. Since 2010, TRB has assumed responsibility for the content of and further development of the TRT. (Readers can refer to Principles for the Organization of the TRT and TRT Development and Maintenance Procedures, for more information about the TRT10.) Exhibit 1-8. Transportation-related glossaries. 10Available at http://ntl.bts.gov/tools/trt/. Category Title Source URL General Bureau of Transportation Statistics Dictionary USDOT. RITA. Bureau of Transportation Statistics. http://www.bts.gov/dictionary/ Glossary of Terms USDOT. RITA. Intelligent Transportation Systems. http://www.standards.its.dot.gov/terms.asp MnDOT Glossary Mn DOT. http://www.dot.state.mn.us/information/ glossary.html Transportation Glossary of Terms and Acronyms Florida DOT. Office of Policy Planning. http://www.dot.state.fl.us/planning/glossary/glossary.pdf Air Air Cargo Glossary Cargo Airlines http://www.cal.co.il/glossary Air Traffic Management Glossary of Terms USDOT. FAA http://www.fly.faa.gov/Products/Glossary_of_Terms/gloss ary_of_terms.html Glossary of Airport Acronyms USDOT. FAA http://www.faa.gov/airports/resources/ acronyms/ Glossary of Civil Aviation and Air Travel Terminology airodyssey.net (blog) http://airodyssey.net/reference/glossary Freight Glossary of the American Trucking Industry Wikipedia http://en.wikipedia.org/wiki/Glossary_of_trucking_industry _terms_in_the_United_ States Planning Glossary USDOT. FHWA. Office of Planning, Environment, & Realty http://www.fhwa.dot.gov/planning/glossary/ Truck and Bus Glossary University of Michigan Transportation Research Institute http://www.umtri.umich.edu/divisionPage. php?pageID=201 TWNA Glossary - Trucking terms TWNA.org http://www.twna.org/trucking_terms.htm Rail Glossary of North American Railway Terms Wikipedia http://en.wikipedia.org/wiki/Glossary_of_ North_American_railroad_terminology Intermodal Glossary Union Pacific http://www.uprr.com/customers/intermodal/integlos.shtml Railroading Glossary Kalmbach Publishing Co http://trn.trains.com/glossary Maritime Glossary of Maritime Terms American Association of Port Authorities http://www.aapa- ports.org/industry/content.cfm?itemnumber=1077&navite mnumber=545 Glossary of Shipping Terms USDOT. Maritime Administration http://www.marad.dot.gov/documents/Glossary_final.pdf

Part 1—Terminology and Categorization Standardization 17 Library of Congress Subject Headings The Library of Congress Subject Headings (LCSH) is the standard subject vocabulary used by university and research libraries in the United States to categorize materials in collections. The LCSH was designed as a pre-coordinated list of subject headings for subject indexing in library catalogs. Although originally intended for content published in books and for library card catalogs, the LCSH has evolved and been adapted to more contemporary models of subject indexing and information search and retrieval. The basis for determining entries is still literary warrant (i.e., based on the emergence of concepts and labels in publications). LCSH headings consist of topics or proper names (e.g., locations, events, and personal or organization names) which can be subdivided by genre, location, and/or time period. The entries include references to preferred labels, some cross-references showing relationships, and some parent and child thesaurus relationships have been added when appropriate. The LCSH includes terminology directly and indirectly relevant to transportation. Recently the Library of Congress (LC) implemented a linked data service for authorities and vocabularies.11 The LC linked data service includes the LCSH, as well as the Name Authority File (i.e., entries for the names of people and organizations) and other miscellaneous vocabularies. The linked data service is intended to enable people and machines to access LC authority data. FGDC Topic Categories FGDC (ISO 19115) Topic Categories, a set of 19 high-level subject categories, provides a stan- dardized way to quickly sort and access thematic information. The FGDC Topic Categories are as follows: • Farming • Biota • Boundaries • Climatology/meteorology/atmosphere • Economy • Elevation • Environment • Geoscientific information • Health • Imagery/base maps/earth cover • Intelligence/military Title Source URL Gale Transportation Thesaurus Cengage Not publicly available. Intellisophic Urban Transportation Systems Intellisophic Not publicly available. NASA Thesaurus NASA http://www.sti.nasa.gov/sti-tools/ Thesaurus for the Australian Transport Index ARRB Group http://www.arrb.com.au/admin/file/ content2/c7/Aust%20Transpt%20I ndex%202010.pdf Exhibit 1-9. Non-transportation-specific thesauri. 11http://id.loc.gov/.

18 Improving Management of Transportation Information • Inland waters • Location • Oceans • Planning/cadastre • Society • Structure • Transportation • Utilities/communication Developing a Common Categorization Scheme for Transportation While having a single, generally accepted scheme for categorizing and organizing trans- portation information might facilitate information management, no such scheme is likely to emerge spontaneously from the past work summarized here. A focused effort would likely be required, either led by an authoritative body or self-organized from the information-user community. The following paragraphs discuss some of the principal issues that such an effort would have to address. Requirements The TRT was originally developed to support bibliographic information retrieval, but the transportation information management landscape has changed since the 1990s. Today most transportation information is “born digital.” There is an enormous volume of data about trans- portation assets, particularly related to their location and use over time. Although asset-based information has not yet been fully integrated, we can expect further integration over time so that eventually there will be an up-to-date digital representation of every asset throughout its life, as well as a longitudinal record that can present a series of asset information snap- shots over time. We should expect that reports making use of data and visualizations will be linked directly to those data sources. The two big future trends are (1) more data and (2) more integration. The key requirement to support these trends is metadata that will enable integration. That metadata will require common topical categories, as well as common sets of proper names for organizations, persons, computer programs, places, and other relevant named entities. Flexibility The transportation information categorization scheme must be applicable to all types of content in all presentation formats. Transportation information takes many forms and exists in many formats. Specifically, the categorization scheme must be readily applicable to database schemas, dataset metadata, document properties, static and time-based visual- izations, as well as heterogeneous content forms that include one or more of these content types or need to be synchronized across types (the dataset associated with a visualization in a document). The scheme needs to be applicable to a collection, an item, and a component, and these should be able to inherit properties from the more general to the more specific level. It should not matter where or how the container holding the values associated with a par- ticular item is implemented—it may be embedded within the object or stored as an external database record referenced to the object using an identifier.

Part 1—Terminology and Categorization Standardization 19 Semantics The transportation information categorization scheme should be based on standards and good practices for representing semantics between and among categories including (1) ANSI/ NISO Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies12, as well as (2) SKOS, the W3C specification on how to represent knowledge organization systems using RDF. The scheme needs to support hierarchical, equivalent, and associative relationships. Hier- archical means broader and narrower concepts. The scheme needs to handle multiple parent (broader) concepts called polyhierarchy. Equivalent means synonyms and quasi-synonyms, near synonyms, abbreviations, acronyms. and other alternate labels. The scheme needs to sup- port identifying regional variations where one label might be preferred in one region while another label would be preferred in another region (e.g., Park and Ride vs. Fringe Parking). Associative relationships means related concepts, called Related Terms (RTs) in TRT. OWL, the Web Ontology Language, may also be used to specialize associative relationships.13 Currency and Governance A transportation information categorization scheme must be frequently updated to reflect the terminology needs of DOTs and remain current in a rapidly evolving area that overlaps many related disciplines and locales. However, it is also important that the scheme is stable and is not changed just in reaction to immediate events. A governance process is needed that (1) defines roles and responsibilities, (2) identifies appropriate policies and procedures, and (3) provides a communication plan to promulgate the scheme and governance processes and communicates changes to all parties. Localization A transportation information categorization scheme must support localization so that a DOT can choose which of several alternative labels available for a concept to use, but still be able to align the label with the core concept. The relationship among alternative labels must be defined so as to enable retrieval of federated search results from across different agencies, which may use different alternative labels. For example, although “Fringe parking” is the preferred LCSH (and the preferred TRT category), “Park and ride” or some variation of this label is frequently used instead. Exhibit 1-10 shows alternate labels for “Fringe parking” based on the LCSH linked data service.14 The TRT also has the category “Fringe parking” which is represented by the TRT unique notational code identifier “Brddf”. Localization can also take the form of subsets of categories to support specific user communities; however, those communities may be defined (e.g., locale, function, expertise, project, and any other subdivision). Ease of Use A transportation information categorization scheme must be easy to use for all stakehold- ers. The report classifies DOT information managers as (1) information professionals, (2) IT professionals, or (3) data managers. Additional stakeholders include DOT GIS offices and all levels of management, transportation professionals in local and regional transportation agen- cies, federal transportation-related agencies, transportation research organizations, and aca- demic institutions. Those doing business with DOTs and other transportation-related agencies, 12http://www.niso.org/apps/group_public/download.php/6487/Guidelines%20for%20the%20Construction,%20 Format,%20and%20Management%20of%20Monolingu. 13http://www.w3.org/TR/owl-ref/. 14http://id.loc.gov/authorities/subjects/.

20 Improving Management of Transportation Information NGOs, and citizens who have an interest in or have a need or want to know about transporta- tion information should be added to this list of stakeholders. Although DOT information managers may be the primary audience, all stakeholders need to be able to understand and use the categorization scheme to some extent. The expectation on the web is that categorization schemes need to be understandable without any training—they need to be as easy to use as Google. This does not mean that training, experience, and subject matter expertise are not important and valuable for obtaining more effective use of a categori- zation scheme, but it does mean that, on the surface, the scheme needs to be usable “out of the box.” The usability of a categorization scheme is usually measured by (1) discreteness of broad categories, (2) consistency in indexing information, and (3) consistency in finding information. Card Sorting. The discreteness of categories can be measured by closed card sorting. In this test, commonly used terms selected from query logs and analytics are sorted into broad catego- ries by representative users. The sorting results are compared to a baseline to measure consis- tency. This provides an independent assessment of how distinct the scheme’s broad categories are perceived to be. Seventy to eighty percent consistency is considered a high usability validation for a categorization scheme. Card sorting is usually done using online tools with iterative sets of 15 to 20 participants sorting up to 50 terms. There is some debate in the usability community about how many participants are required to provide meaningful results and at what point add- ing participants does not add meaningfully to the results.15 The research team’s opinion is that 15 to 20 participants provide meaningful results. Indexing Consistency. Inter-indexer consistency can be measured by having representa- tive users index a set of representative types of information—data, visualizations, and docu- ments. The values used to index each information item are compared to the baseline to measure Exhibit 1-10. Alternate labels for the concept Fringe parking. 15See for example: Jakob Nielsen. “How Many Test Users in a Usability Study?” Alertbox (June 4, 2012) http://www.useit.com/ alertbox/number-of-test-users.html.

Part 1—Terminology and Categorization Standardization 21 completeness and consistency. Alternative values are identified, assessed, and applied to the scoring scheme as appropriate. This provides an independent assessment of the categorization scheme’s usability for indexing. Seventy to eighty percent consistency is considered a high vali- dation. Indexing is usually done as a paper exercise with iterative sets of 15 to 20 participants indexing 5 to 10 information items. Findability Findability can be measured by having representative users look for, or describe how they would look for, a set of representative content items. The categories and subcategories used to find content items are compared to a baseline to measure consistency. Alternative category paths are identified, assessed, and applied to the scoring scheme as appropriate. This provides an independent assessment of the categorization scheme’s usability for finding information. Seventy to eighty percent consistency is considered a high validation. Finding is done as a computer- or paper-based exercise with iterative sets of 15 to 20 participants searching for 5 to 10 information items. Reengineering the TRT The previous section outlined the key requirements for a transportation information catego- rization scheme. Exhibit 1-11 shows scoring by the research team of the strengths and weak- nesses of the TRT based on these requirements. The scale used is 1 to 5, with 1 being a major weakness and 5 being a major strength. “0” denotes non-applicable. The major strengths of the TRT are in the areas of Semantics and Currency. The major weaknesses of the TRT are in the areas of Flexibility, Localization and Ease of Use. Overall, the TRT is a well-designed thesaurus, but may not be suitable for digital information in the future. TRT Strengths The major strengths of the TRT are in the areas of Semantics and Currency. The research team considers Semantics a strength primarily because the TRT follows the Z39.19 NISO stan- dard for thesaurus construction. However, the TRT does not follow, use, or support SKOS or OWL. Strength 1: Semantics. The TRT uses RT both to associate siblings (e.g., Private Truck- ing and Tank Trucking, Trucking and Bus Transportation, and so forth) and to associate terms in different facets. Although the Z39.19 standard discusses several RT relationships between terms in the same hierarchy—overlapping sibling terms, mutually exclusive sibling terms, and derivational relationships—these associations are problematic to model in other standards. For example, in SKOS, one cannot relate concept A to B if concept A is already broader than B. Also, in the interest of usability, it is often the editorial practice not to make associative relationships to terms that are siblings, or even to terms that are nearby in the same tree (e.g., child to grand- parent). Overuse of generic associative relationships diminishes their usefulness and value. This adds complexity to the syndetic16 structure without necessarily adding any value. Strength 2: Currency. The research team considers Currency a TRT strength because of the well-defined and documented governance processes. Areas for improvement include expand- ing participation in the governance process by subject matter experts from the wider group of information management stakeholders, particularly to include data and GIS managers who can represent the needs and requirements of DOTs. The modes and formats of communication need to use the latest methods for receiving and routing term requests and for subscribing to 16The cross-references and relationships between terms.

22 Improving Management of Transportation Information and being sent notifications of all term request status changes. This (together with the recom- mendation for handling subsets and syndication of other authoritative sources discussed in the next section) will help improve the currency of the TRT. TRT Weaknesses The major weaknesses of the TRT are in the areas of Flexibility, Localization, and Ease of Use. Weakness 1: Flexibility. The TRT has been designed to be used primarily on document- based content, especially technical reports. For example, as illustrated in Exhibit 1-12, the Infor- mation organization facet of the TRT includes a set of terms for types of Documents, but not for datasets or data-based visualizations. GIS is a category under Information management > Information systems. In this example, the TRT mixes information organization functions with Exhibit 1-11. TRT scored against requirements. Requirement TRT Category Flexibility 2.14 Applicable to all types of content Database schemas 1 Data set metadata 2 Document properties 4 Static visualizations 3 Time-based visualizations 2 Heterogeneous 3 Synchronize across types 0 Semantics 3.00 Support standards Z39.19 4 SKOS 0 Relationships Hierarchical Parent/Child (BT/NT) 5 Polyhierarchy 5 Equivalent Synonyms (UF) 5 Alternates 0 Associative RTs 5 OWL support 0 Currency and Governance 4.33 Roles and responsibilities 4 Processes 5 Communications 4 Localization 0.00 Regional variations 0 Other subsets 0 Ease of use 2.33 Broad category discreteness 3 Indexing information 2 Finding information 2 OVERALL SCORE 2.65

Part 1—Terminology and Categorization Standardization 23 information forms. The current best practice is to separate categories that are forms from cate­ gories that are functions into separate broad divisions, facets, or, even into separate schemes. An important information management requirement is to synchronize data with its repre­ sentation in documents. While synchronization itself requires a system of identifiers that will be understandable to people and also be interpretable by systems, a common categorization scheme to describe both datasets and documents is crucial. Weakness 2: Localization. The TRT offers no way to identify alternate terms and use them instead of the TRT preferred form. The Z39.19 NISO standard for thesaurus construction pro­ vides guidance on how and when to choose one form of a term or another as the preferred form based on criteria related to usage, spelling, abbreviations, jargon, trade names, popular names, loan words, and proper names. The reality is that these criteria are somewhat subjective and variations may be preferred for similar subjective reasons. Therefore, as many variations as pos­ sible must be collected in a categorization scheme, and it must be possible for users to apply their own criteria to choose which variation will be the preferred term in their local implementation. As discussed above, SKOS solves this problem with the representation of a concept as a unique identifier and terms being associated with that identifier via a type of relationship. Localization is a key functional requirement for the categorization scheme. Weakness 3: Ease of Use. The TRT has been designed for “indexers, content managers and librarians in the transportation community,”17 and not the broader group of transportation Exhibit 1-12. TRT information organization facet showing selected subcategories. 17TRT “About” webpage (http://ntl.bts.gov/tools/trt/)

24 Improving Management of Transportation Information stakeholders. The TRT is intended to be comprehensive and exhaustive. It is very large, but it does not include any proper names for geographic locations, DOTs, or federal transportation programs or provide links to authoritative sources for such relevant named entities. The cur- rent expectation for ease of use is that a categorization scheme must be intuitive and “natural” to use for tagging and finding content. Practically, the TRT needs to provide methods to create more usable subsets to support specific applications, while maintaining a master authoritative resource. Recommendation to Address Weaknesses: Microthesaurus The Z39.19 standard for thesaurus construction briefly discusses the concept of a micro- thesaurus as a subset of a broader thesaurus created to be used in “specific indexing products.”18 According to the standard, microthesaurus requirements include the following: • There should be a defined scope for a specialization of the broader thesaurus. • Terms and relationships extracted from the broader thesaurus should have integrity (i.e., there should be no orphan terms). • Additional specific descriptors can be added, but should be mapped to the structure of the broader thesaurus. Providing a way to enable DOTs and other TRT users to generate and maintain micro- thesaurus subsets would address the flexibility, localization, and usability weaknesses discussed. There are two broad models for implementing a microthesaurus—centralized and decentral- ized. In the centralized model, a new term property and set of controlled values is defined to be used to code any thesaurus term so that it can be included in a microthesaurus. In this model, the so-called broader thesaurus is, in effect, a microthesaurus so each term must be identi- fied. To define a microthesaurus, preferred and variant terms would be individually selected. In this model, term relationships generally remain unchanged from the broader thesaurus. Excep- tions are automatically handled so that (1) broken hierarchical relationships will be collapsed, (2) orphaned equivalent and associative term relationships will be ignored, and (3) relationships will be added for any new more specific terms. As shown in Exhibit 1-13, a thesaurus manage- ment tool that can provide distributed access is usually required to implement such a centralized service. Solid lines indicate tight integration; dotted lines indicate loose integration. Exhibit 1-14 illustrates a decentralized service that would be decoupled (or only loosely cou- pled) to the broader thesaurus infrastructure. In this model, all or selected TRT terms would be downloaded to a microthesaurus site. That site would have its own thesaurus management system—tools and processes. No coding would be required to identify that a TRT term is to be included in the microthesaurus. Exceptions to relationships would be handled by editors so that (1) broken hierarchical relationships would be rationalized, (2) orphaned equivalent and associative relationships would be resolved, and (3) relationships would be added for any new more specific terms. If more specific terms are outside the scope of the TRT, then they would not be submitted as candidate terms. Instead, TRT would maintain a catalog or registry of microthesauri. It would also be valuable for TRT to maintain a catalog of external terminologies for proper names from authoritative sources such as the Library of Congress. The decentralization is not only applicable to microthesaurus building, but also to distrib- uting the work of identifying, building, and maintaining terminology subsets as a general vocabulary management model. 18Z39.19-2003, p. 33.

Part 1—Terminology and Categorization Standardization 25 Exhibit 1-13. Centralized microthesaurus/subset model. Exhibit 1-14. Decentralized microthesaurus/subset model.

26 Improving Management of Transportation Information Community-Based Vocabulary Management Model This section describes how a community-based transportation information terminology management model could work. What Does a Community-Based Model Look Like? A community-based model for transportation information terminology management del- egates responsibility for creating and maintaining terms and groups of terms to distributed responsible organizations such as DOTs. The central office functions as the overall terminology editor and coordinates the distributed effort by • Managing the terminology management environment; • Identifying who will be responsible for what term subsets; • Routing candidate terms requests to distributed editors based on expertise, volume, or other criteria; • Communicating editorial policies and answering editorial questions; • Providing training to distributed editors; • Communicating additions and changes to subscribing users; and • Coordinating communication with the community at large. Governance Model Terminology governance models need to specify the (1) roles and responsibilities, (2) policies and procedures, and (3) plan for communicating with the community about the program as well as providing notifications about terminology additions and changes. Roles and Responsibilities. Exhibit 1-15 provides an overview of the terminology gover- nance roles and the relationships among them. Governance Board. The governance board provides executive sponsorship, arbitrates dis- putes and disagreements, is responsible for long-term decision making, and endorses policies and procedures as needed. The governance board provides leadership and final decision-making authority for the core team of terminology editors on matters related to the transportation information terminology Exhibit 1-15. Terminology governance roles.

Part 1—Terminology and Categorization Standardization 27 (hereafter referred to as TransIT). The role of the governance board is to define the vision of improving transportation information retrieval using XML tagging, metadata, and controlled vocabularies. The board should be limited to four or five members, representing the major stakeholder groups, to foster agile decision making. This governance board would arbitrate disputes and disagreements between stakeholders regarding usage of TransIT. Members would meet quarterly with the TransIT core team to review execution of strategic plans and to ratify recommended new policies and procedures. Terminology Manager. The terminology manager provides day-to-day operational support for TransIT including • Routing change requests to distributed editors or directly handling them, • Responding to requests regarding content classification, • Enforcing the terminology governance policies and procedures, • Reviewing requests for changes that may affect the overall terminology structure, and • Ensuring that procedures for tagging content are effective. The terminology manager will ensure execution of the daily activities needed to support TransIT governance and strategy. The manager will be a primary owner of the terminology management tool to document, edit, and publish the TransIT. The manager also needs to understand the context of key terms, term relationships, and vocabularies necessary for users to retrieve information from transportation information management systems. The manager will advise on how to consistently and appropriately use TransIT to populate metadata in trans- portation information management systems. Core Team. The core team represents various stakeholders. The team will • Be responsible for creating, updating and cataloging terminology subsets that have been assigned for their stewardship; • Keep the terminology dynamic, updated and relevant; and • Understand terminology editorial principles. These individuals will be the primary point of contact for the terminology manager on ter- minology issues and should have a good understanding of taxonomy and metadata principles as well as the overall vision of the governance board. Content Owners. These individuals determine appropriate content as well as access con- trol, content provisioning, and compliance with regulations. Content owners advise the core team and terminology manager on initiatives to improve accuracy and usability of metadata and vocabularies when applied to various types of content. Policies and Procedures The TransIT terminology should be governed by a documented set of policies and procedures. The current TRT procedures could be used as a model and updated to reflect a community- based vocabulary management model. At a high level, these principles can be summarized as follows: • All primary actions related to editing the terminology should be driven by policies and procedures; • To ensure sufficient review and communication prior to adding or editing a concept, a change request should be submitted; • All change requests should be logged and acknowledged by the terminology manager; and • An assessment of the benefits and effect of change requests should be completed and com- municated quickly.

28 Improving Management of Transportation Information The foundation for any terminology subset service is to implement a persistent method to identify TRT terms. The current system of unique notational codes is not an adequate term identifier system. The TRT should adopt a Uniform Resource Identifier scheme. Communication Plan As the TransIT terminology continues to evolve, communications and outreach to more constituents will be imperative for long-term success. It will be important for the terminology governance board to ensure consistent communications are released that • Explain the process to request a change; • Explain the actions of the terminology governance board and associated roles, including overall goals, terminology and metadata strategy, and decision-making process; and • Present the value of the terminology in a meaningful and concise manner. As a first step, the TransIT terminology governance strategy should be made publicly avail- able to all current and future stakeholders. This will enable all interested parties to understand the immediate scope, efforts, and decision-making constraints of the taxonomy. In short, by making clearly visible the processes governing the TransIT, the team will be encouraging a greater understanding of its efforts for all transportation information management constitu- ents. Some of the greatest benefits of this approach should be an increase in valuable feedback, more complete requests and up-to-date terminology. Practices from Other Fields This section discusses good practices from other disciplines and applications that can improve the management of transportation information. Each section provides a short over- view of the discipline or application, followed by some specific practices applicable to transpor- tation information management. Enterprise Content Management (ECM) ECM includes processes and applications to create and manage all types of content across an organization. Data and content management are converging. The practical implications are that data management policies and processes are being applied to more information types. A best practice relevant to transportation information management is the emergence of faceted taxonomy for categorizing and retrieving information with ECM applications. A taxonomy is a system for identifying and naming things and classifying them according to a set of rules. A faceted taxonomy is a set of mutually exclusive taxonomy divisions that can be used to classify a content item. Faceted taxonomies are frequently used in online shopping websites to help customers find products to purchase. For example, shopping for shoes online would typically include facets for gender, brand, style, material, and color. The same principles used in designing a shopping website are being applied to classifying content in ECM applica- tions. For transportation applications, the taxonomy facets might be content type, location, asset type, asset material, and asset age. For ECM, some taxonomy facets emerging as common practice include • Document type; • Business function; • Organizational unit; • Location; • Time period; and • Topic.

Part 1—Terminology and Categorization Standardization 29 Exhibit 1-16 describes facets generally applicable to all types of content and some transporta- tion examples. Although it seems as if there would be standard lists of values for types of resources, there are not. Dublin Core includes a type vocabulary as part of the Dublin Core® Metadata Initiative (DCMI) Terms19; however, the list of values (i.e., Collection, Dataset, Event, Image, Interactive Resource, Moving Image, Physical Object, Service, Software, Sound, Still Image, and Text) is not appropriate for ECM. A resource type vocabulary is almost always uniquely developed for each organization. The lists may be similar from one organization to another, but one agency’s “whitepaper” may be another agency’s “technical report.” Resource types may be a component of a records management schedule as discussed below. A records retention schedule is the listing of the types of organizational records, with guide- lines for their transfer or disposal, including the length of time records need to be retained. Most states have regulations about the retention and disposition of state and local agency records, including those from DOTs. Records retention schedules are usually organized by function and then by type. Records Management Records management is the set of processes related to managing an organization’s informa- tion resources over their lifecycle from the time they are created until they are disposed of. While ECM includes creating, managing, and using the resources to support organizational activities, records management includes collecting, classifying, storing, and destroying or pre- serving the resources. DOTs need to manage their records to • Comply with state regulations; • Preserve an historical record; and • Support the long-term maintenance of assets and applied research. Facet Name Description Transportation Examples Type The type of resource Technical Report, Data Set, Whitepaper, Drawing Function The purpose of the content and any business activity or function that the content is about or related to Procurement, Human Resources, Engineering, Policy Making Organization The organizational unit that the content is about or related to FHWA; TRB; VDOT Location The geographic location or facility that the content is about or related to Vancouver, WA; San Antonio Federal Complex; 800 E Leigh St, Richmond, VA Coverage The period when the content is effective or that it refers to January 1, 2010–December 31, 2010; September 30, 2009, 5:00 pm; Twentieth Century Subject Other themes that the content is about or related to Maintenance Practices, Domestic Transportation, Liability Exhibit 1-16. Common taxonomy facets used in ECM with transportation examples. 19http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms.

30 Improving Management of Transportation Information DOTs as public agencies need to comply with state requirements to maintain records accord- ing to an appropriate retention schedule and also need to be able to make their records avail- able on request from the public as required by the Freedom of Information Act (FOIA). DOTs also have a broader responsibility to preserve records with historical value and need to apply appropriate criteria to select those records, which should be preserved even beyond the required retention period. Finally, DOTs have to maintain transportation assets that may have long life- cycles (e.g., bridges and tunnels). This includes maintenance, repair, and replacement of trans- portation assets, as well as gathering information over the whole lifecycle that can be used for applied research to inform transportation engineering decisions and good practices over time. Content Framework An information management framework, referred to hereafter as the information manage- ment process, is provided in Exhibit 1-17. Digital Preservation Most resources today are born digital—They are created, published, and used electronically. The electronic version of the content is the resource of record. This may be a PDF file for a tech- nical report, a set of database layers for a GIS visualization, or a CAD file format for a blueprint. All of these are electronic files that require specialized software to be rendered human readable. Simply preserving the bits will not ensure that the files will be readable in the future. Although an electronic version of a technical report may be retrievable based on the words and phrases Exhibit 1-17. Information management framework. PRODUCERS CAPTURE (Create • Collect) CURATORS ADMINISTER USERS RETRIEVE (Find • Search) (Manage • Store • Organize • Archive • Preserve • Deliver • Disseminate • Distribute)

Part 1—Terminology and Categorization Standardization 31 that occur in the file, GIS, CAD, and many other types of visual electronic content do not con- tain descriptions that can be used to retrieve them. Descriptive information, called metadata, needs to be created and linked to the file so information can be stored and retrieved. Because most DOT resources are being produced electronically, digital preservation is becom- ing the primary focus of records management. Digital preservation, as with other forms of records management, requires processes to identify, collect, classify, store, and preserve digital resources. Content Harvesting With most resources being produced electronically, there is the opportunity to systemati- cally collect resources using automated methods such as spiders and robots that are the same as those used by web search engines. Document Management Document management is a set of processes and applications for managing and storing elec- tronic documents. Generally, these sorts of applications manage document-like content but not databases or visualizations generated from data (e.g., GIS or CAD). However, information output from a database such as a spreadsheet, table, or other form of columnar report no longer dynamically tied to the source database is a “document” for the purposes of document man- agement. Similarly, a static map, diagram, or blueprint output from a GIS or CAD system is a “document.” For DOTs, many documents are generated as part of the administration, project management, performance, research, inventory, and other business functions. File Directory Structures When documents are uploaded to a document management system, they need to be put in a shared file directory structure. Thus an important part of configuring such a system is to design a specification for how the directory structure will be set up and maintained. The goal of a file directory structure scheme is to organize the directory and name files so that it will be easy to find documents based on the directory structure and sort order of the file names. Thus file directory structures should provide an intuitive method that will enable (1) consistent filing of documents in the appropriate place in the folder structure and (2) a reliable method for finding documents in the system. Personal Working Documents. According to Hicks, “the most common criteria for nam- ing directories are the purpose or function (85%), the name of the project (55%) and the date (20%).” (p. 23:36) Working documents are commonly organized by project, then by context (such as function), then by year, and then by type. Although these components are commonly found, the order of these components vary depending on the function that owns the file store. Hicks studied personal electronic files which are likely to be found on a personal or shared file store. Engineers do not typically set up document management systems, so the particular scheme described by Hicks is not necessarily the one likely to be found in such a system. The best prac- tice for personal working documents is a folder structure based on document type, chronology, subject or work flow. Organization Shared Files. The best practice for shared file stores is to use the organiza- tional structure as the overall directory structure. This practice dates from the era of paper files in filing cabinets and is still common practice. The problem with using organizational structure is that when there is a re-organization, the directory structure needs to be changed. This was true with paper files in filing cabinets and is still the case with electronic files. However, it is much easier to change a directory name than it was to change the name of a department in paper files.

32 Improving Management of Transportation Information With electronic files, problems may arise when content items are linked to (hyperlinked or referenced) if the directory structure and file name are used as the identifier for the content item. So for content items stored on a shared file store, it is not a good idea to change the name of a directory, folder, or file name. The preferred way to handle the effect of re-organization on a file structure based on orga- nizational structure is to stop using the old folder and create a new folder for the re-named or new organizational entity. After all, content should be stored (and labeled) with the name of the organizational entity that existed when the content was created. That was a valid entity at one time. Although this principle is useful, it is often better to create a directory structure based on a scheme that will be more stable than organizational structure. The best practice is to use business function categories. For example, even though the organizational entity called “Risk Evaluation, Analysis & Management” may be changed to “Risk Evaluation & Management,” the business function is still “Risk Management.” Website Content Management. For a website (as opposed to a general-purpose file store), it is also useful to group content by type. This facilitates maintenance, administration, and approv- als because different types of content have different lifecycles. For example, a “Policy” has a dif- ferent approval process than “Home Page.” Root folders should be labeled with the content types, for example • . . . /Application/Model/ • . . . /Document/Policy/ • . . . /Image/Photo/ • . . . /Web Page/Home-Page/ • . . . /Component/Org-Chart/ File Naming Conventions Files should be named so that they are grouped usefully by alphabetical and chronological sorting. This should be done using a multi-part structure, usually with three segments such as purpose or function, short title, and date. File names should start with a function code based on a controlled vocabulary of business functions (see Exhibit 1-18). Short titles should be no more than 20 characters long. Titles could automatically be the first 20 characters of the Microsoft Office Document Properties Title field or a title defined by the document creator. Titles should describe the purpose and intent of the content. Articles, pronouns, and conjunctions should be avoided as much as possible. Dates should be related to the content (e.g., the report date explicitly shown in the document), not the system-generated last-modified date. Dates should be formatted in a standard way that sorts chronologically (e.g., YYYYMMDD). Exhibit 1-19 illustrates the basic file naming structure and provides examples. Information Science Information Science is an interdisciplinary field at the intersection of library science, computer science, linguistics, philosophy, communications, and other disciplines. This section discusses information retrieval (IR) methods that can automate content classification. Automated cate- gorization and text mining use various analytical methods to (1) identify words and phrases relevant to the meaningful use of content items and (2) generate statistical representations of content items based on an analysis of those words and phrases. Using IR methods can help

Part 1—Terminology and Categorization Standardization 33 DOTs classify large volumes of content without hiring armies of librarians. This can help DOTs comply with records management regulations and open government policies and also participate in data publishing initiatives like data.gov. Automated Categorization When people categorize content it is labor intensive, and the results may be incomplete and inconsistent. Once configured, automated methods are less expensive and very consistent, but it is difficult to consistently produce accurate results. There is always a tradeoff between accu- racy and completeness, which is called “precision” and “recall” in information retrieval. The best scenario is when automated methods are used to suggest classifications and subject matter experts (SMEs) review and improve those classifications. Several ways exist for computers to tag content, ranging from very simple to quite complex. This section reviews several of these techniques. Keyword and Regular Expression Matching. The most obvious method to automatically tag content is to associate several keywords and key phrases with each category. The text of each File Name Function Code Top-Level Function Category Adm Administrative Resources ETS Engineering & Technical Services Fin Finance Gov Governance & Ethics HES Health, Environment & Safety HR Human Resources IT Information Technology Lead Leadership PA Public Affairs Res Research Risk Risk Management SCM Supply Chain Management Exhibit 1-18. Example of controlled vocabulary of business functions. Exhibit 1-19. Basic file naming structure and examples.

34 Improving Management of Transportation Information content item is searched. If there is a match, the item is assigned to the matching category. Many tools provide keyword matching capability as part of their capabilities. Another technique is to match regular expressions—patterns that include wildcards and other features—to offer broader or narrower matches than simply an exact match against a keyword or key phrase. Simple key- word and regular expression matching is likely to lead to false positives and false negatives. False positives are when an item is incorrectly part of a result set; false negatives are when an item is not retrieved but should be. The accuracy of this approach can be improved through more complex approaches to the rules and various ways of preprocessing the text before the rules are applied (e.g., to narrow the matching to words that are nouns, and phrases to noun phrases through part of speech (POS) analysis). Templates and Business Rules. Another basic method is for metadata values to be hard- coded into the template used for adding content to the system. For example, the Procurement office of a DOT may have a template staff use for creating Requests for Quotation (RFQs). That template may have several metadata fields with pre-defined values (e.g., the Type, Organization, Function, Project, and Format). This method is cheap and pragmatic—it helps reduce the work- load on the people entering the content, keeps tagging costs low, and reduces staff resistance to tagging. Its flexibility can be increased by adding small dropdown picklists based on the identity of the person filling out the form, the business process in which the form is being used, or other simple contextual interactions. Complex Pattern Categorizers. There are many ways to improve keyword matching. One way is to introduce scoring, instead of a simple match/no match decision. For example, the num- ber of matches for each category within a content item can be counted, and the item assigned to the most frequently occurring category. Another enhancement is to weight the matches by where they appear and what is being matched. Matches in the title are normally weighted more heav- ily than matches in the body. Phrase matches are normally weighted more heavily than single word matches. Negative matches, known as exclusions, can be used to reduce false positives. For example, the “financial institutions” category might match “bank,” but not “river bank” or “Georges Bank.” Clues from the surrounding context can be used to boost or reduce the score of a match. Each category needs its own rules to describe the things that match the category. Those rules start very simple, frequently just the category name and a synonym or two. They are itera- tively improved by analyzing ever-increasing amounts of content, tagging the content with rules, then looking for the false positives and false negatives. This can be quite a task if the taxonomy is large and/or the content uses a diverse vocabulary. Entity Extraction. Entity extractors are software routines that scan the text and find men- tions of entities such as people, places, organizations, and products, as well as addresses, dates, currency amounts, job titles, and topics. Extractors can assign a suggested category to these terms and make these terms available for consumption by a target application. Recognizing entities does not, by itself, categorize the content. Categorization software can, however, score the occur- rences of the entities and assign categories based on that score, similarly to the way text matches can be weighted and scored. Recognized entities can be more robust than simple text matches for several reasons. First, entity recognition does a better job of distinguishing company and personal names than simple text matching does. For example, is “Charles Schwab” a mention of a person or a company? Entity recognizers can use clues such as POS tagging of the content and patterns in the neigh- borhood of the text to identify noun phrases that are more likely to be a company name than a personal name. Second, entity recognizers typically make use of large authority files—lists of the names of people, places, organizations, products, and so forth which also include variations

Part 1—Terminology and Categorization Standardization 35 on those names. Authority files provide patterns to be recognized that are not obvious varia- tions of the full name of a company. In addition to complex patterns, entity recognizers also make use of heuristics (i.e., guesses). For example, a number followed by a proper name that ends in St., Dr., or Blvd. is probably a street address. A noun phrase ending in a gerund, such as Minnesota Mining and Manufacturing, is probably an organization name. Trained Categorizers. Machine learning techniques can be applied when a collection of items has already been categorized. A collection of pre-categorized items is known as the training set. Using the training set, the categorizer “learns” the words that occur, and co-occur, in each category. Although it is called learning or training, what is really happening is that the algorithm is building lists of words and computing statistics about them. Once the algorithm has built up information on the relative frequencies of various words in the different categories, new docu- ments have their word information compared to the data stored for each category. The docu- ment is then tagged with the category that is best matched. The Vector Space Model. The Vector Space Model was first used by the SMART Retrieval System developed by Gerald Salton at Cornell University in the 1960s. In this model, each docu- ment is represented by a term vector (i.e., a column of numbers generated by analyzing the document’s content). When a new document needs to be categorized, one compares its vector to all the others and sees which one or ones it is closest to. Bayesian Classifiers, Support Vector Machines, Hidden Markov Models, and other advanced categorization techniques are all based on this foundation. Text Mining While automated classification focuses on classifying content, text mining focuses on turn- ing text into analytics that can be processed to identify patterns and trends. Text mining tasks may include categorization, entity extraction, summarization, taxonomy generation, and senti- ment analysis. However, text mining focuses more on collection, than on the individual content item. This section presents some important uses of the output of automated classification in the context of text mining. Identifying Candidate Categories and Synonyms. A simple form of text mining is key- word extraction, where a string of text is parsed by POS to identify nouns and noun phrases. A de-duplicated list of nouns and noun phrases, with the exception of a list of excluded words and phrases, is what is usually provided as a raw keyword extraction result. As described above, keywords are the raw material on which entity extractors work. After keyword extraction results are processed against an authority file, anything left that has not been matched can be considered candidate categories and/or synonyms to be reviewed by the authority file editor. Collection Analysis to Identify Trends and Anomalies. The categorizations of content collections—whether by content type, function, or topic—tend to exhibit a Zipf distribution,20 where roughly 80% of the content falls into 20% of the categories, and where there is a steep drop-off, called “the long tail,” which contains roughly 80% of the categories and 20% of the content. Any content collection can be analyzed to identify (1) which categories fall within which part of the Zipf distribution and (2) any divergences from the expected Zipf distribution. For example, the distribution of types of content should not be even across a collection—they should 20Zipf ’s law states that the frequency of any word is inversely proportional to its frequency ranking. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

36 Improving Management of Transportation Information follow a Zipf distribution. Some types of content will dominate, while other types will occur much less frequently, and categorization schemes should be designed and maintained to preserve this occurrence behavior. From a classification management perspective, all types of classification schemes when applied to a collection should exhibit a Zipf distribution. If they do not exhibit the Zipf dis- tribution, then the categorization scheme needs to be revised to split categories occurring too frequently and to merge categories that are too sparse in the collection. From a collection man- agement perspective, the distribution of content in the collection should exhibit a Zipf distri- bution. If not, then content in over-represented categories should be reviewed for deletion, and content in sparsely represented categories should be targeted for acquisition or development of more content. Taxonomy Management Tools Taxonomy management systems (TMS) enable an organization’s users to view, apply, and modify a common set of controlled vocabulary lists to classify enterprise content. TMS tools are commonly associated with functions such as metadata management, indexing, and search. TRB uses an in-house system developed specifically to maintain and publish the TRT. Reen- gineering the TRT requires more microthesaurus functionality, including mapping and pub- lishing capabilities that support more complex and localized versions of the Thesaurus. Core Functionality Exhibit 1-20 provides a typology of taxonomy tool functions, which are further described in this section. A TMS generally provides the following basic activities around taxonomies and individual categories within them: • Adding: The ability to easily add taxonomies and categories, including batch adding. • Editing: The ability to easily edit taxonomies and categories, including batch editing. Functional Area Functions Taxonomy Development Create a taxonomy User roles and permissions Taxonomy Maintenance Add, edit, move, delete items Assign or modify privileges to one or a group of items Activity logging Taxonomy Governance Approval workflow for additions and changes Metadata Controlled Vocabulary Assign attributes to a category Associate controlled vocabulary with metadata field Thesaurus capabilities User Interface Search and browse Drag and drop Multiple windows Reporting Alphabetical, hierarchical and other views Visualizations Importing and exporting taxonomies Application Integration APIs (WSDL, scripts, Java, etc.) Application integration (CMS, DMS, search engine, etc.) Exhibit 1-20. Typology of taxonomy tool functions.

Part 1—Terminology and Categorization Standardization 37 • Deleting: The ability to easily delete taxonomies and categories, including batch deleting. • Mapping. The ability to easily and automatically map taxonomies and categories from source taxonomies, such as topics from the TRT and a DOTs applications. • Importing/Exporting. The ability to easily import and export taxonomies from and to other source taxonomies, including batch importing and exporting, while maintaining semantic and structural integrity. Common file formats such as database APIs and XML must also be supported. Additional capabilities must include the ability to easily configure the software to work with common data structures (e.g., hierarchical and polyhierarchical taxonomies, controlled vocabularies, and bibliographic catalogs). A TMS tool should enable users to • Store multiple taxonomies in a standard format, accessible through a common interface. • Browse a taxonomy to explore categories and view information about classification terms, including definitions, relationships, explanations, and examples. • Search the taxonomy and related vocabulary lists, for example, to identify the authorized term synonymous with the search term. • Edit the properties of elements within the taxonomy, such as relationship types, facets, terms, and notes. • Represent multidisciplinarity through polyhierarchies with multiple “parent” relationships. Other potentially valuable features in a TMS are • Support for multiple simultaneous users. • Role-based security (e.g., varying access control for taxonomy owners, authors, and gen- eral users). Roles must be well-defined and enforced within the TMS. Roles should include managers (approvers), curators (editors), and coordinators and support who help manage and stage workflow. End users must be able to submit change requests. Varying degrees of permissions for end users should also be possible. • Audit trail and tracking of taxonomy editing and revision activities, as well as capabilities for analysts to resolve and reconcile multiple footnotes, allowing them to provide descrip- tions of data exceptions. This functionality must also include the ability to search time series and audit trails to find justifications for footnoting. There is also a requirement to represent taxonomy items as they existed in specific time periods and to be able to track, represent, and report their changes over time. • Editorial support tools such as task prioritization because numerous change requests are anticipated. • Integration with other workflow and collaboration software. • Online collaboration for taxonomy creation and change (e.g., voting). TMS Benefits to DOTs An examination of the typical features of a TMS shows how such a system can benefit DOTs: • Central repository for the TRT or a successor global transportation taxonomy. A TMS would constitute a single resource where all DOT data managers, analysts, librarians, and other users could find any transportation taxonomy and view that taxonomy, its history, and its relationship to other related taxonomies. • Version control and role-based permissions. Any changes to a given taxonomy would be made only by an authorized taxonomy “custodian” assigned responsibility for maintaining the authoritative version of that taxonomy. When the custodian makes a change, the details of each change and the reasoning behind the change could be noted for future reference. Taxonomy custodians (who could be survey managers, publication managers, or dataset managers) would have local control over their own taxonomies, but all appropriate staff could access the taxonomies and use them in their own work products.

38 Improving Management of Transportation Information • Taxonomy harmonization and standards of practice. The TMS would require users to follow specific procedures to make modifications to a given taxonomy and could prompt taxonomy owners to make changes when necessary (e.g., when a taxonomy is deemed to be outdated). This would encourage greater rigor in taxonomy development and enhance the quality and consistency of survey taxonomies. The TMS can also link local taxonomies and taxonomy practices to the systems architecture and information management policies of the entire enterprise. For example, FHWA could implement a TMS to ensure compliance with overall information and data policies. • Taxonomy crosswalks and comparisons. The TMS would store crosswalks between taxono- mies, which would be especially useful to DOT users who need to integrate multiple datasets (for analysis and data integration). Robust TMS tools issue each taxonomy category a unique identifier and preserve that unique identifier to allow for comparisons between taxonomies (e.g., tracking terms with identical labels but from different taxonomy sources). • Voting and collaborative editing. In cases where changes to a taxonomy would affect multiple DOT stakeholders, the TMS could provide a platform for facilitating a collective decision on those changes. Taxonomy Tools Vendors Taxonomy management systems constitute a very small market within the enterprise software environment and are not covered by any technical analysts such as Forrester, Gartner, or IDC. A few TMS vendors compete for a very small market share (estimated to be between $16.5 and $66M in 2012). Conclusions Part 1 has summarized the current transportation information management landscape in the context of broader information management trends and developments. Part 1 provides many examples of practices that may be adapted in order to improve DOT management of transportation information (e.g., guidance for file formats, naming conventions, and infor- mation preservation strategies). We have discussed the convergence of data and content in this landscape, manifested most visibly in the combination of data and content from mul- tiple sources on the web. Although data.gov is leading to wider publication and access to raw datasets, a rich categorization scheme such as the TRT that gathers and maps the semantics of transportation information should facilitate the interchange, combination, and use of hetero- geneous sources. The semantic strengths of the current TRT should be used and the weaknesses addressed to make the TRT more flexible and easy to use for DOTs. The primary recommenda- tion is to develop a way to enable DOTs to identify and generate TRT subsets or microthesauri easily. At the same time, DOTs need incentives to be more engaged in the TRT governance.

Next: Part 2 - Studies of Leading Practices »
Improving Management of Transportation Information Get This Book
×
 Improving Management of Transportation Information
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Report 754: Improving Management of Transportation Information is a selective review of practices of state departments of transportation (DOTs) and other agencies that collect, store, and use transportation data and information. The report also includes potential guidance on strategies and actions a DOT might implement to help improve information capture, preservation, search, retrieval, and governance.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!