National Academies Press: OpenBook
« Previous: 1.0 Introduction
Page 9
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 9
Page 10
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 10
Page 11
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 11
Page 12
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 12
Page 13
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 13
Page 14
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 14
Page 15
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 15
Page 16
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 16
Page 17
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 17
Page 18
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 18
Page 19
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 19
Page 20
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 20
Page 21
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 21
Page 22
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 22
Page 23
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 23
Page 24
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 24
Page 25
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 25
Page 26
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 26
Page 27
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 27
Page 28
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 28
Page 29
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 29
Page 30
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 30
Page 31
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 31
Page 32
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 32
Page 33
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 33
Page 34
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 34
Page 35
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 35
Page 36
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 36
Page 37
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 37
Page 38
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 38
Page 39
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 39
Page 40
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 40
Page 41
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 41
Page 42
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 42
Page 43
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 43
Page 44
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 44
Page 45
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 45
Page 46
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 46
Page 47
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 47
Page 48
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 48
Page 49
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 49
Page 50
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 50
Page 51
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 51
Page 52
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 52
Page 53
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 53
Page 54
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 54
Page 55
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 55
Page 56
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 56
Page 57
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 57
Page 58
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 58
Page 59
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 59
Page 60
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 60
Page 61
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 61
Page 62
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 62
Page 63
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 63
Page 64
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 64
Page 65
Suggested Citation:"2.0 Washington State DOT Tests." National Academies of Sciences, Engineering, and Medicine. 2020. Information Findability Implementation Pilots at State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25883.
×
Page 65

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 2 State Transportation Agencies 2.0 Washington State DOT Tests 2.1 Planning and Scoping Round 1 The WSDOT findability test was designed to build on a separate ongoing effort at WSDOT to improve management and findability of engineering manual content. This effort was originally inspired by another state DOT’s implementation of a wiki combining content from their various manuals. Because WSDOT’s manuals are produced in Portable Document Format (PDF), there is no easy mechanism to search across the body of manuals to find content related to a particular topic. WSDOT’s vision was to enable search and navigation to the content of interest across the various manuals. Preliminary work had looked at explicit cross-references built into the manuals to gain an understanding of interrelationships across the different manuals. WSDOT was interested in further exploring connections across the manuals to inform the development of a pilot demonstration system within their web platform (Drupal). WSDOT had selected eight manuals to include in this pilot demonstration. These eight manuals each had content related in some manner to stormwater management. The initial NCHRP 20-97 test was scoped to: • Assemble content from a set of engineering manuals; • Conduct an analysis using an unsupervised machine learning technique (cluster analysis) to identify common topics or themes across the manuals and understand how these topics were distributed in different manual chapters; • Demonstrate a computational approach to partially automate the process of splitting up manual sections into subsections or “chunks” of text to be displayed on individual web pages; • Create a set of facets to serve as filters on manual content based on ways that manual users want to search and navigate the consolidated body of content; • Develop a sample ontology (set of terms with associated semantic relationships) based on the selected facets; and • Demonstrate auto-classification of individual manual sections based on selected terms in the ontology. Round 2 The second round of testing built on the work that had been completed in round 1. The scope included the following activities: • Extend the ontology developed in round 1 to include additional engineering concepts and terms and use the extended ontology to further classify the content in WSDOT’s online manuals; • Work with subject matter experts to identify compelling search/discovery use cases for the manuals website and validate the extended ontology; and

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 3 State Transportation Agencies • Create a video demo that walks through several use case scenarios demonstrating how the ontology adds value for search and discovery. 2.2 Content Collection and Analysis Round 1 Content Harvesting and Processing WSDOT provided 18 of their engineering manuals in PDF format for the content analysis. A description of each manual included in the content analysis is provided in Table 1. The first eight of these manuals were included in the pilot, which focused on stormwater manuals. The other ten manuals were added to provide a larger footprint for the text analysis and lay the groundwork for potential future expansion of the initial pilot. Table 1. Description of the Manuals Included in the Analysis WSDOT Manual Description Design Manual (DES) The Design Manual provides policies, procedures, and methods for developing and documenting the design of improvements to the transportation network in Washington. It has been developed for state facilities and may not be appropriate for all county roads or city streets that are not state highways. Environmental Manual (EVM) The Environmental Manual is a compilation of environmental policies and processes that are used as a guidance resource for the Washington State Department of Transportation (WSDOT) and its environmental consultants. The Environmental Manual outlines WSDOT’s legal requirements related to environmental, cultural, historic, and social resources and is a keystone of WSDOT’s environmental compliance strategy. Highway Runoff Manual (HRM) The Highway Runoff Manual guides the planning and design of stormwater management facilities for existing and new Washington State highways, rest areas, park-and-ride lots, ferry terminals, and highway maintenance facilities throughout the state. The HRM establishes minimum requirements and provides uniform technical guidance. Hydraulics Manual (HDM) The Hydraulics Manual provides detailed information on hydrologic and hydraulic analysis related to highway design. This manual should be used in conjunction with the WSDOT Highway Runoff Manual and the WSDOT Design Manual, specifically Section 1210.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 4 State Transportation Agencies WSDOT Manual Description Roadside Manual (RSM) The Roadside Manual supplements the Roadside Policy Manual by explaining how to implement the policies found in the RPM. The Roadside Manual links partners working on WSDOT roadsides. Chapters include laws and policies, visual functions, wetlands, wildlife, safety rest areas, soil amendments, contour grading, soil bioengineering, vegetation, restoration, and design enhancement. Roadside Policy Manual (RPM) The Roadside Policy Manual provides practical roadside restoration policies and guidance, which are based on minimizing life cycle costs while providing operational and environmental functions. It promotes ecological context, environmental preservation, and maintainability. The manual is intended for project planning, scoping, environmental permitting, and engineering designers, landscape architects, and construction and maintenance personnel. Temporary Erosion and Sediment Control Manual (TESC) The Temporary Erosion and Sediment Control Manual replaces Chapter 6 and Appendix 6A of the WSDOT Highway Runoff Manual. It outlines WSDOT’s policies for meeting the National Pollutant Discharge Elimination System Construction Stormwater General Permit requirements and the requirements in Volume II of the stormwater management manuals published by the Washington State Department of Ecology. Utilities Manual (UTL) The Utilities Manual provides guidance in accommodating utilities within the state right of way in a manner that does not interfere with the free and safe flow of traffic or impair the highway's visual quality. Information is provided about the preparation of utility agreements and utility service agreements. Consultant Services Manual (CSM) This manual provides guidance concerning the authorization, selection, and use of consultants for Personal Services and Architectural and Engineering contracts and/or supplements. Cost Estimating Manual for WSDOT Projects (CEM) The Cost Estimating Manual for WSDOT Projects provides a consistent approach to cost estimating, estimate reviews, estimate documentation, and management of estimate data. It provides guidance on how to treat common and recurring challenges encountered in the cost estimating process. This guidance should be used as a tool in the project delivery process.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 5 State Transportation Agencies WSDOT Manual Description Development Services Manual (DSM) The Development Services Manual is a major component of the department's overall strategy to promote a consistent statewide development review process and the application of mitigation policies. This manual provides policies and procedures for reviewing proposed developments; assessing development impacts to the state highway system; determining appropriate improvements and or shared contributions to mitigate impacts; writing interlocal agreements and other agreements with local agencies and public and private parties, and considering access to the state highway system. Highway Surveying Manual (HSM) The Highway Surveying Manual presents surveyors' methods and departmental rules that apply to highway surveying operations. The manual is intended to help standardize surveying practices throughout the department and to be a useful tool for department surveying crews. Maintenance Manual (MNT) The Maintenance Manual provides maintenance personnel with procedures and guidelines for maintaining the state highway system. It focuses on equipment, materials, facilities, techniques and other information to carry out maintenance activities of the department. This manual does not establish absolute standards but provides uniform operating procedures and performance guidelines. Plans Preparation Manual (PPM) This manual provides instructions and guidance for the preparation of right of way plans, contract plans, special provisions, and estimate packages for highway construction projects. It also provides the standards used in the preparation of these plans using Computer Aided Drafting and Design. Right of Way Manual (ROW) The Right of Way Manual provides guidance on real estate acquisition, title, appraisal, relocation, and property management. Techniques of Right of Way Plans Preparation (RWP) Right of way plans when approved by the State Design Engineer become the official document used to acquire real estate and other property rights (both temporary and permanent). Traffic Manual (TRM) The Traffic Manual is a guide for department personnel in traffic operations and design. It does not establish absolute standards but establishes uniform guidelines and procedures for the use of traffic control devices.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 6 State Transportation Agencies WSDOT Manual Description Utilities Accommodation Policy (UAP) The Utilities Accommodation Policy was established in cooperation with the utility industry. The Utilities Accommodation Policy addresses: • AASHTO policy guidelines on accommodating utilities within highway rights of way. • State laws and regulations governing the accommodation of utility facilities. • Compliance with Federal-aid policies and procedures. Performing text analysis on this corpus required preprocessing the text in the manuals before examining it with the text analytics engine. Preprocessing filters out content that does not add value to the meaning of the text. The following sections describe the preprocessing work steps. Converting PDF Files to Text WSDOT’s Technical Manuals are written in Microsoft Word. Graphics are prepared in Adobe Photoshop, Illustrator or InDesign and then imported into Word documents. To distribute these manuals electronically, Microsoft Word documents are converted to a PDF. Each PDF file encapsulates a complete visual description of a fixed-layout flat document. Converting a PDF file back to a text file is difficult because PDFs contain information on how to present documents visually—the format is not designed for editing. Converting PDFs to text files requires removing all the instructions on how to present a document so that only the words remain without the formatting. Converting Microsoft Word documents to text is more straightforward. A mixture of Microsoft Word and PDF documents were made available for conversion – not all of the original Word files could be located. Once the documents are converted to text, they are ready to be ingested by text analysis applications. Deconstructing Manuals into Sections and Subsections Answering user queries based on content from the manuals requires the search engine to direct users to precisely the right place in an engineering manual that contains the answer to the question. To find the section, subsection, or paragraph, where the answer resides, the content must be deconstructed into text or HTML files that address a single topic. Use of consistent and standardized styles in original Microsoft Word documents to indicate different level sections facilitates the deconstruction process. Use of bookmarks in PDFs to designate different sections is also very helpful. Figure 1 presents an example of a deconstructed page from the Temporary Erosion and Sediment Control (TESC) Manual.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 7 State Transportation Agencies Figure 1. Example section from the Temporary Erosion and Sediment Control Manual. A user’s search for “turbidity samples,” would find the answer in this subsection of the TESC manual. If other manuals contain similar relevant information, that content would be presented in the result set as well, with equal precision. Figure 2 contains a small snippet of the code that was used to deconstruct the manuals.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 8 State Transportation Agencies Figure 2. PHP Script designed to convert a PDF document to a txt file or an html file. Removing “noise” Preparing text files for text analysis requires removal of words that will not add value to understanding what the content is about. These words or groups of words are considered “noise.” There are three types of noise that need to be removed from a corpus before analyzing the meaning of the content: • Stop words • Lists of words that do not add value to understand the content, like a table of contents • Boilerplate Language Stop words When analyzing text files using text analysis applications, some extremely common words that provide little value to the understanding of a document are excluded from analysis.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 9 State Transportation Agencies These words are called “stop words.” Common stop words are: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with. Table of Contents and Glossaries The Table of Contents (TOCs) and Glossaries are also removed because counting the frequency of words in these sections adds unnecessary weight to the term frequency counts. Counting words in TOCs or Glossaries do not reflect the words used in context, so they should be excluded when analyzing the meaning of a document. Boilerplate language Boilerplate language is often used to communicate how a document can be used or contain copyright information. Again, the words in sections of a document that contains boilerplate language do not add value to understanding the meaning of the content. An example of boilerplate text occurs in the Design Manual is shown in Figure 3. Figure 3. Boilerplate example. Content Analysis Content analysis is the task that helps to identify the words and phrases that will be critical to find and discover content in the manuals. These words and phrases provide the foundational elements of the taxonomy, ontology, search facets, and controlled vocabularies. Text Mining and Cluster Analysis were the two text analysis techniques chosen to analyze the corpus of engineering manuals. These techniques and their application for the WSDOT manuals corpus is described below. Text Mining Text Mining is a the process of deriving information from text by identifying and counting words and phrases, and analyzing patterns and trends. The words and phrases that frequently occur in a document can provide insight into the meaning of the text. For the WSDOT test, the purpose of the text mining exercise was twofold: (1) to identify terms to be included in the ontology and (2) to demonstrate the technique to be used for future implementation at WSDOT’s efforts to build and enhance their thesaurus.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 10 State Transportation Agencies Many different text analysis algorithms and software applications can perform text mining. We used a Natural Language Processing (NLP) module in Python. Our analysis determined that there are over 47,000 distinct one-word and two-word phrases within the 18 manuals1. There are 1,139,634 occurrences of these 47,000 words or phrases in the corpus. A Pareto analysis (illustrated in Figure 4) found that 3,495 words or phrases accounted for 80% of the frequently used terms with 911,000 occurrences of these 3,495 words or phrases in the engineering manuals corpus. This set of 3,495 words can be used to help develop the ontology. The frequency counts are used to decide whether a word or phrase appears often enough to justify inclusion in the ontology. Words used only once, in only one of the manuals, may not be salient enough to include in the ontology. Figure 4. Pareto Analysis of word frequency in the WSDOT engineering manuals. 1 Hyphenated words can be preprocessed so that they are treated either as single words or multiple words. 0% 20% 40% 60% 80% 100% 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Cu m m ul at iv e Pe rc en ta ge N um be r o f t im es e ac h W or d or P hr as e oc cr us in th e Co rp us 1 or 2 Word Phrases in the Corpus (n=47,000) Pareto Chart: 0.07% (n=3,494 of 47,000 words or phrases) account for 80% of the words or phrases that occur most frequently in the manuals 80% (911,000 cumulative occurrences of 3,495 words/ 1,139,634 cumulative occurrences of 47,000 words or phrases) # Occurrences Cumulative %

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 11 State Transportation Agencies Table 2 lists 80 words that account for 20% of the most frequently used words. While text mining can provide evidence of words and themes that are representative of the corpus’ meaning, manual effort is required to distinguish words that are meaningful and are candidates for inclusion in a taxonomy or ontology to describe the content in the manuals. The meaningful terms are highlighted in the table. Table 2. List of Frequently Used Terms-Single Words (18 Engineering Manuals) Single Word Term # Manuals Frequency Count Cumulative % projected 18 8358 0.7% uses 18 6663 1.3% designed 18 6614 1.9% manual 18 6243 2.5% stated 18 6052 3.0% planned 18 5486 3.3% utility 18 5466 3.9% pages 18 5176 4.5% highway 18 4955 4.8% areas 18 4610 5.2% required 18 4278 5.6% rights 18 4267 6.0% included 18 4177 6.4% controls 18 4142 6.7% provided 18 4074 7.1% sectioned 18 3791 7.4% information 18 3677 7.7% properties 18 3501 8.0% work 18 3489 8.3% ways 18 3469 8.6% saw 18 3184 8.9% right 17 3128 9.2% construction 18 3019 9.5% serviced 18 2909 9.7% accessed 18 2817 10.0% costs 18 2815 10.2% needed 18 2806 10.5% facility 18 2692 10.7% traffic 18 2673 10.9%

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 12 State Transportation Agencies Single Word Term # Manuals Frequency Count Cumulative % agency 18 2579 11.2% permit 18 2531 11.4% management 16 2525 11.6% regions 18 2495 11.8% flows 17 2479 12.0% typed 18 2458 12.3% agreements 18 2448 12.5% requirement 18 2426 12.7% water 17 2398 12.9% standard 18 2331 13.1% determined 18 2328 13.3% reviewed 17 2312 13.5% processed 18 2275 13.7% siting 18 2267 13.9% environmental 18 2235 14.1% maintenance 18 2207 14.3% within 18 2175 14.5% system 18 2139 14.7% department 18 2134 14.9% contracted 17 2060 15.0% datums 18 2038 15.2% documented 18 2036 15.4% existed 18 1996 15.6% offices 18 1953 15.7% impacting 18 1951 15.9% making 18 1920 16.1% transportation 18 1872 16.2% soil 17 1857 16.4% lined 18 1842 16.6% material 18 1825 16.7% times 18 1823 16.9% limits 18 1782 17.0% runoff 14 1773 17.2% showing 18 1771 17.4%

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 13 State Transportation Agencies Single Word Term # Manuals Frequency Count Cumulative % locations 18 1756 17.5% cities 17 1732 17.7% following 18 1722 17.8% conditioned 18 1710 18.0% locals 18 1691 18.1% pipes 13 1686 18.3% developments 18 1674 18.4% reported 16 1661 18.6% foot 18 1641 18.7% roadways 18 1637 18.8% sloping 16 1623 19.0% also 18 1616 19.1% appendix 16 1610 19.3% considered 18 1576 19.4% formed 18 1575 19.5% approvals 18 1570 19.7% lands 17 1568 19.8% surveys 16 1568 20.0% Each of the highlighted terms are used in 14 to 18 manuals indicating that there are many terms that are used widely across manuals. Table 3 includes two-word phrases. Some of the two-word phrase are meaningful, while others, like “way manual” are not. The meaningful terms are highlighted. Table 3. List of Frequently Used Terms-Two-Word Terms (18 Engineering Manuals) 2-Word Term # Manuals Frequency Count right of 17 3128 type of 17 1020 needs to 18 969 within the 18 969 based on 18 882 state highway 16 854 prior to 18 820 use to 18 786 way manual 4 735

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 14 State Transportation Agencies 2-Word Term # Manuals Frequency Count referred to 18 669 Washington state 18 665 highway runoff 11 649 limited access 14 607 service manual 8 597 accordance with 15 591 number of 18 547 traffic controls 13 540 used for 16 528 less than 17 522 required to 17 507 local agencies 15 499 included in 17 476 states department 18 455 meet the 16 444 responsible for 17 442 use in 17 441 necessary to 18 438 utility accommodations 7 427 displaced person 1 425 consultant services 3 424 creek near 1 421 standard specification 14 420 development services 7 419 management practices 13 418 due to 17 416 use the 16 416 applied to 18 413 cost of 17 397 real estate 13 396 plan and 18 394 related to 17 387 provided the 17 385 reviewed and 17 381

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 15 State Transportation Agencies 2-Word Term # Manuals Frequency Count shown in 14 380 best management 11 378 runoff treatment 4 378 land use 14 373 way plan 10 355 real properties 9 354 flow control 5 353 contact the 16 351 area of 17 346 plan sheets 12 345 described in 16 341 information on 18 339 information management 1 336 orders to 18 330 personal property 4 328 highway rights 13 319 subject to 16 311 wsdot environmental 9 309 see the 15 308 standard plan 11 308 purpose of 17 300 changes in 18 299 associated with 16 298 relocation assistance 5 297 cost estimates 14 295 access control 12 293 compliance with 15 293 complied with 18 290 sediment control 9 284 Cluster Analysis Cluster analysis divides textual data into conceptually meaningful groups. In the context of understanding and classifying unstructured text, cluster analysis identifies conceptual classes that can be used for classification of content. Thus, the terms identified in a cluster can be used

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 16 State Transportation Agencies as the basis for identifying categories and subcategories of concepts that, when arranged hierarchically can help a user navigate and find content that meets their requirements. We selected the open source software programming language Python 3.6 with its various components to perform the analysis. We primarily used NLTK and SCIKIT-Learn modules within Python. NLTK (Natural Language Toolkit) provides a suite of text processing libraries for the tokenization, classification, stemming, tagging, parsing, and semantic reasoning. These terms are defined in Table 4. Table 4. Definitions of Natural Language Processing Functions in Python NLTK Function Definition Tokenization In lexical analysis, tokenization is the process of segmenting running text into words, phrases and sentences. Tokenization is required before text processing can be done by a computer. Segmentation of words and phrases is done to detect meaningful patterns of words that will be used by other natural language algorithms to compute the meaning of the text. Classification Classifiers label tokens with category labels. In NLTK, classification is accomplished by manually building a training set that represents content in specific categories. NLTK defines several classifier classes that contain various algorithms for labeling tokens and classifying content. These classes include: • ConditionalExponentialClassifier • DecisionTreeClassifier • MaxentClassifier • NaiveBayesClassifier • WekaClassifier Stemming For grammatical reasons, documents can contain different forms of a word, such as drive, drives, driving. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality. The goal of stemming is to reduce words to a common base form. Tagging Tagging in the NLTK library refers to the process of classifying words into parts of speech. The labeling of each word is known as tagging or part of speech tagging.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 17 State Transportation Agencies Function Definition Parsing Parsing in NLP processing is the process of determining the syntactic structure of text by analyzing its constituent words based on the underlying grammar. The output of the parsing process is a parse tree in which the sentence is the root, and intermediate nodes are noun phrase and verb phrase. This type of decomposition of sentences assists the NLP programs to determine the meaning of a sentence. Semantic Reasoning Semantic reasoning relies on explicit human- understandable representations of the concepts, relationships, and rules that comprise a knowledge domain (like transportation engineering). These relationships are represented in a semantic model called an ontology. Using an ontology, the computer can infer or combine concepts to answer questions or draw conclusions. This is called semantic reasoning. SCIKIT-Learn is a machine learning library that features classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. To install most of the Python packages, we downloaded Anaconda 3, which is an open source distribution of Python and its various packages. Two types of unsupervised learning models were tested as described below: • K-means clustering is an unsupervised machine learning model that is widely used for clustering documents into distinct groups. Each document (or in WSDOT’s case, document section) is assigned to a single cluster. The number of clusters is an input to the analysis. There are methods that help determine the optimal number of clusters for the data. Once the clusters are created, a “silhouette” metric can be calculated, providing a measure of how similar an object (a document) is to its assigned cluster versus other clusters. The silhouette value ranges from −1 to +1, where a high value indicates that the object is well matched to its cluster and poorly matched to neighboring clusters. For each cluster, the set of words that contribute most to that cluster can be reviewed to develop appropriate cluster labels. See reference (1) for a discussion of k-means clustering. • Latent Dirichlet Allocation (LDA) is another unsupervised machine learning model that is used for Topic Modeling. With this method, documents are assumed to be “about” a mixture of topics. Therefore, rather than sorting documents into distinct groups, multiple subject tags can be assigned. The LDA model produces for each document in the corpus a probability that the document belongs to a given topic.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 18 State Transportation Agencies We decided to focus on the k-means clustering algorithm because our preliminary results from the LDA models showed that most documents belonged to a single topic. See references (2) and (3) for transportation-related example applications of LDA. To test our machine learning approach, we first converted the pdf files to text files and then performed preprocessing tasks such as tokenization, removing punctuation and stop words, stemming words, etc. to get the data ready for machine learning. To run a model, we must represent the unstructured text data in a numerical format for the algorithms. We do this by either creating a “bag of words/term frequency matrix” or a “tf-idf matrix”. The “bag of words” approach creates a frequency count of all the words (terms) in a document. In the “tf-idf matrix” approach, “tf” is the term frequency and “idf” is the inverted document frequency of the term in the document. The “tf-idf” weights the frequency of a term in a document with a factor that discounts its importance when it appears in almost all documents. Therefore, the terms that appear too rarely or too frequently are ranked lower than terms that make up the balance of the content, and hence, are expected to be better able to contribute to clustering results. (See reference (2)). To test the potential value of making our models domain specific, we downloaded the Transportation Research Thesaurus (TRT) and imported it as a .csv file to be read by Python. We tokenized the TRT terms into multi-word phrases and looked for those multi-word phrases in the manuals. For example, “air carriers” would be one token and not “air” and “carriers.” The TRT was our controlled vocabulary. We ran the analysis on four different datasets: 1. Full set of 18 manuals divided into chapters (n=350) 2. Eight stormwater-related manuals divided into chapters (n=174) 3. Eight stormwater-related manuals decomposed into further subsections (n=1890 observations) In the above datasets, boilerplate sections labeled TOC, Glossary, Forward, Front, etc. were removed from the analysis. Two different sets of inputs were used to create the tf-idf matrix for the k-means models: • All of the single word terms in the manuals, or • Multi-word terms from the TRT augmented with additional synonyms for selected topic areas. Based on this preliminary analysis, we observed that the results using all words appeared to be more robust than those that were limited to the TRT terms. There were many additional substantive WSDOT-specific terms that were not represented in the TRT. We therefore chose to go with the all terms model. We tried different tuning parameters in our model to determine the optimal number of clusters (using the Elbow method) and the cohesiveness of clusters (based on the silhouette coefficient). In our final model, we had 18 clusters using a maximum of 250 features as input into the model. We output the top 20 terms for each cluster to help determine a label for each cluster.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 19 State Transportation Agencies The 18 clusters resulting from the analysis (full set of 18 manuals divided into chapters) are shown in Table 5. Cluster names were assigned based on the set of representative terms belonging to each cluster. It can be seen that these results are fairly good but could be further improved through removal of additional words from the analysis (e.g. names of months and abbreviations like “NA”) and collapsing similar clusters (e.g. “Traffic Engineering” and “Traffic and Safety”; “Stormwater” and “Hydraulics”). Table 5. Results of Cluster Analysis – 18 Manuals Cluster Top 15 Terms 1. Roadside roadside, plant, function, veget, visual, soil, restor, community, environment, polici, mainten, landscap, tree, enhance, nativ 2. Structures/Geotechnical bridg, structure, wall, barrier, railroad, geotechn, hq, nois, contour, slope, sheet, tabul, berm, onlin, clearanc 3. Environmental/Project Review environment, commit, permit, nepa, impact, june, feder, agenc, approv, mainten, fhwa, analysi, review, resource, polici 4. Estimating cost, contract, overtime, consult, premium, fee, negoti, indirect, rate, weight, labor, profit, hour, audit, payment 5. Traffic Engineering citi, pedestrian, roundabout, curb, illumin, park, traffic, signal, light, street, path, width, cross, mainten, access 6. Consultant Services cso, consult, contract, agreement, firm, negoti, acl, profession, task, solicit, competit, septemb, select, procur, request 7. Construction Contracts estim, cost, item, price, bid, april, scope, review, history, specialti, contractor, phase, costbas, quantity, analysi 8. Traffic and Safety traffic, sign, safeti, control, zone, lane, vehicl, speed, roadway, oper, intersect, element, mainten, instal, barrier 9. Stormwater Water, stormwat, eros, sediment, discharg, bmps, runoff, soil, control, flow, infiltr, tesc, april, surfac, prevent 10. Utilities util, franchis, accommod, instal, agreement, reloc, shall, cost, right, approv, facil, applic, zone, permit, control 11. Plans na, cid, lt, yes, septemb, appendix, vacat, sheet, rt, checklist, chart, remark, lb, flow, addendum 12. Hydraulics hydraul, culvert, pipe, flow, channel, stream, fish, inlet, woodi, river, march, passag, wash, veloc, structur 13. Access Management access, connect, permit, shall, control, rcw, septemb, driveway, author, approach, appendix, applic, hear, wac, properti 14. Survey survey, monument, control, point, data, accuraci, januari, adjust, datum, map, instrument, gps, station, observ, rod

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 20 State Transportation Agencies Cluster Top 15 Terms 15. Development Review sepa, gma, counti, local, land, appeal, los, agenc, impact, propos, environment, mitig, review, cipp, rtpo 16. Right of Way right, properti, apprais, acquisit, parcel, res, februari, real, titl, certif, easement, acquir, owner, reloc, estat 17. Traffic Impact Mitigation mitig, impact, local, agreement, traffic, payment, improv, review, los, interloc, wsdotwagov, propos, permit, agenc, shall 18. Geometrics lane, curv, ft, speed, width, hov, ramp, exhibit, vehicl, cross, geometr, traffic, shoulder, intersect, facil Figure 5 illustrates which of the 18 manuals include one or more chapters classified into each cluster. This illustrates the degree to which certain topic areas cross the different manuals. A high degree of cross-over supports the hypothesis that creating a consolidated, interactive body of manual content will be beneficial by enabling users to more easily discover related content across manuals. Seven of the 18 clusters are focused on just a couple of manuals (Estimating, Hydraulics, Access Management, Development Review, Consultant Services, Survey, and Traffic Mitigation). Five of the clusters span five or more manuals (Roadside, Environmental/Project Review, Traffic Engineering, Traffic and Safety, Right of Way). The remaining six clusters span 3-4 manuals. Figure 5. Cross-referencing of clusters by manual. Manuals 1. R oa ds ide 2. Str uc tur es /G eo tec hn ica l 3. En vir on me nta l/P ro jec t R ev iew 4. Est im ati ng 5. Tra ffic En gin ee rin g 6. Co ns ult an t S erv ice s 7. Co ns tru cti on Co ntr ac ts 8. Tra ffic an d S afe ty 9. Sto rm wa ter 10 . U tili tie s 11 . P lan s 12 . H yd rau lic s 13 . A cce ss Ma na ge me nt 14 . S ur ve y 15 . D ev elo pm en t R ev iew 16 . R igh t o f W ay 17 . T raf fic Im pa ct Mi tig ati on 18 . G eo me tri cs Des ign Manual (DES) Cost Estimating Manual (CEM) Consultant Services Manual (CSM) Development Services Manual (DSM) Temporary Eros ion and Sediment Control Manual (TESC) Environmenta l Manual (EVM) Hydraul ics Manual (HDM) Highway Runoff Manual (HRM) Highway Surveying Manual (HSM) Maintenance Manual (MNT) Plans Preparation Manual (PPM) Right of Way Manual (ROW) Roads ide Pol icy Manual (RPM) Roads ide Manual (RSM) Techniques of Right of Way Plans Preparation (RWP) Traffic Manual (TRM) Uti l i ties Accommodation Pol icy (UAP) Uti l i ties Manual (UTL)

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 21 State Transportation Agencies Round 2 The round 2 tests did not add content to the manuals site – activities focused on creating additional ways of navigating and searching the original set of eight manuals. 2.3 Solution Development and Testing Round 1 Scope Development of the WSDOT findability solution consisted of the following activities: • Ingesting the manual “chunks” into Drupal (WSDOT’s web content management system) and Indexing this content using Solr (WSDOT’s search engine); • Selecting facets of interest for search and discovery of manual content; • Selecting specific terms or tags from these facets to be used for demonstration of auto- classification; • Creating an ontology built around the selected terms – including synonymous, equivalent terms, and related terms; • Demonstrating the use of the ontology for assigning tags to manual sections – based on the presence of class names in the ontology, equivalent terms and related terms for each tag; and • Demonstrating and describing how the ontology can be integrated with WSDOT’s internet search engine (Solr). Ingesting and Indexing the Content The chunks of manual content were ingested into the Drupal content management system. The Drupal Book module was selected for this solution because it provides functionality to navigate across different manual sections. Each HTML page of content is stored as a record in a MySQL database. Apache Solr is an open source search engine that is integrated in Drupal. Solr was used to crawl and index the full text of each of the HTML pages in the pilot corpus. Solr can achieve fast search responses because, instead of searching the text directly, it searches an index instead. The type of index created by Solr is called an inverted index because it transforms a page- centric data structure to a keyword-centric data structure. Solr stores this data structure in a directory called index in the data directory. In Solr, a document is the unit of search and index. A document in the online engineering manuals corpus corresponds to a “chunk” of content which typically is a section or subsection of a manual. Before adding documents to Solr, a schema is specified in a file called schema.xml. The schema declares: • What kinds of fields (metadata elements) are in the document? • Which field should be used as the unique/primary key?

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 22 State Transportation Agencies • Which fields are required? • How to index and search each field? The fields types in Solr include: • Float • Long • Double • Date • Text A field type declaration looks like this: <field name="id" type="text" indexed="true" stored="true" multiValued="true"/> The values are defined as: • name: Name of the field • type: Field type • indexed: Should this field be added to the inverted index? • stored: Should the original value of this field be stored? • multiValued: Can this field have multiple values? Solr stores all these data in an inverted index. To construct an inverted index, the program isolates all the words in the documents and sorts them in ascending lexicographical order. Term Documents ID buffer Doc1 island Doc2 curb Doc3, Doc7, Doc12 roundabout Doc1, Doc5, Doc7, Doc12, Doc15, Doc17 culvert Doc3, Doc7, Doc8, Doc10, Doc 12, Doc13, Doc15, Doc18, Doc19, Doc21 In the example shown above, if a user searched for curb AND roundabout, the results would include Doc7 and Doc12. Arranging the search index in this way makes search efficient and fast. Document metadata is included in the index and allows the search engine to weight documents that are tagged with metadata to be weighted as more relevant than documents without the metadata tag. In our example, if roundabout was a metadata item, the program would boost it above other results that did not include roundabout as a tag. Selecting Facets and Terms for Auto-classification An initial set of facets was proposed as part of the scoping process:

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 23 State Transportation Agencies • Asset type (e.g., material related to traffic signals or culverts), • Mode (e.g., material related to pedestrian and bicycle accommodations), • Project delivery task/deliverable (e.g., material related to producing different components of an Interchange Justification Report) - based on WSDOT’s Master Deliverables List (MDL), • Practical solutions life cycle stage (e.g., material related to scoping), and • Business function (e.g., material related to preventive maintenance, asset data collection or budgeting). Selection of facets was based on potential value for search as well as availability and maturity of controlled vocabularies at WSDOT representing the facet. To assess potential value for search, interviews with selected target users were conducted. Based on the interviews, it was determined that the asset and task-based facets were of most interest. Users were also interested in searching based on certain cross-cutting topics. Several existing WSDOT controlled vocabularies were reviewed to see if they could be used for the test: • Practical Solutions Thesaurus - a thesaurus developed as part of WSDOT’s Practical Solutions effort with the goals of creating a glossary of common terminology across the Department related to various stages of the Practical Solutions life cycle. • Engineering Publications Glossary - a consolidated glossary of terms defined in WSDOT’s engineering publications (including the set of manuals included in the NCHRP 20-97 test effort). Stormwater-related terms from this glossary were extracted to support WSDOT’s stormwater pilot. • Transportation Asset Classification Scheme and Thesaurus - A transportation asset taxonomy developed as part of a prior study by Kent State. • Asset Framework - Preliminary work done by WSDOT’s asset management group to define an asset inventory and component framework. • Engineering Content Management (ECM) Taxonomies – Taxonomies developed supporting a prior effort to implement an engineering content management system at WSDOT. • Drupal Taxonomies – Taxonomies developed supporting WSDOT’s web redesign efforts. • WSDOT’s Master Deliverables List (MDL) – A tool used to create specific project work breakdown structures including milestones and deliverables for construction project development. This review concluded that: • The MDL provides a robust basis for a project delivery/task-based facet. The first two levels of the MDL are shown in Table 6. • The Kent State asset thesaurus is a useful resource – in particular, the effort to build this resource included application of text mining to identify a rich body of asset-related terms. However, this thesaurus has not been sufficiently validated at WSDOT. Ongoing work to create an asset inventory and component framework is still in early stages. These two efforts will need to be harmonized at a future date.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 24 State Transportation Agencies • The Practical Solutions thesaurus and the Engineering Publications glossary also provide useful resources but primarily consist of terms and definitions and contain very few synonyms (equivalent terms) or relationships across terms. • The ECM taxonomies defined “disciplines” which were similar in some respects to the clusters that came out of the cluster analysis. These disciplines could potentially be used to create a subject area taxonomy. However, there are varying levels of details and approaches to classification within each discipline. See Table 7 for the ECM categories. • The Drupal taxonomy included facets for regions, mountain passes, project categories, project phases, WSDOT organizational units and modes. The modes facet can provide an initial starting point for manual content classification, though is fairly high level, consisting of six categories: Aviation, Bike, Highways, Ferries, Public Transportation, Rail and Walk. Table 6. WSDOT’s Master Deliverables List (First 2 levels) Process Group/Element Project Management Project Management Plan Development & Maintenance Consultant Administration Community Engagement & Public Involvement Cost Risk Estimating & Management Value Engineering Project Scoping Preliminary Estimates & Schedules for Scoping Agency & Tribal Coordination for Planning Studies Existing Conditions Inventory & Analysis for Planning Studies and Scoping Improvement Options Development & Assessment for Planning Studies Planning Report Project Profile (Project Summary)

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 25 State Transportation Agencies Process Group/Element Design Project Delivery Method Selection Project Data, Survey Data, and Base Map Interchange Justification Access Control Materials (Roadway) Geotechnical Bridge and Structures Roadway Geometrics and Plans Hydraulics/Drainage Partnerships Railroad Facilities Plans Roadside Restoration and Site Development Traffic Analysis Traffic Design & Plans Utilities Work Zone Traffic Control - Design & Plans Design Documentation R/W Base Map and R/W Plans Environmental Review and Permitting Endangered Species Act Compliance Section 106 & EO 05-05 Compliance Discipline Reports NEPA/SEPA Compliance Environmental Permits Environmental Commitment File Design-Build Procurement Design-Build Contract Package Statement of Qualification Phase Proposal Phase Plans, Specifications & Estimates Contract Plan Sheets Preparation Contract Specifications Development Construction Estimate Development Construction Permits Constructability Reviews PS&E Reviews Project Shelf Contract Ad & Award

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 26 State Transportation Agencies Process Group/Element Real Estate Services Appraisal/Administrative Offer Summary Review & Determination of Value Acquisition Relocation/Relocation Review Board and/or Adjudicative Hearings Property Management Condemnation/Possession & Use R/W Certification Construction Construction Engineering Construction Milestones Table 7. WSDOT’s ECM Taxonomies Discipline Categories Agreements Architectural & Engineering Services Construction Developer Services Information Technology Inter-Agency Leases & Rentals Personal Services Purchased Services & Goods Railroad Rates Specialty Group Internal Agreements Utilities Buildings Architectural Electrical Foundations Mechanical Superstructures Bridges & Structures Design Documentation Plans, Specifications and Estimates Construction Management Construction Administration Payroll and Other Confidential Information Environmental Archaeological & Other Confidential Information Cultural Resources Endangered Species Act Hazardous Materials NEPA-SEPA Permits

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 27 State Transportation Agencies Discipline Categories Public Lands - Section 4(f) & 6(f) NEPA-SEPA Air Climate Change Coastal Areas & Shorelines Ecosystems Energy Environmental Justice Farmland Floodplain Geology & Soil Groundwater Hazardous Materials Historic - Cultural - Archaeological Resources Land Use Noise Public Services & Utilities Social & Economic Surface Water Transportation Visual Impacts Water Resources Wetlands Wild & Scenic Rivers Wildlife Fish & Vegetation Geotech Geotech Hydraulics Hydraulics Landscape Architecture Irrigation Mitigation & Roadside Design Plan Establishment Site Design Visual Quality Materials Materials Project Administration Budgeting Change Management Contractor Payments Cost Estimating Cost Performance Document & Content Management Financial Planning Funds Management Legal Documents Project Management Planning & Procedures

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 28 State Transportation Agencies Discipline Categories Project Reporting Regional & Statewide Programming Risk Management Scheduling Scoping Trend Analysis Vendor Payments (Accounts Payable) Work Order Accounting Workforce Planning Project Design Design Documentation Plans, Specifications and Estimates Public Involvement Administrative - Internal Communications Informational Materials & Web News & Media Public Outreach & Responses Real Estate and Right of Way Acquisition Appraisal Property Management Relocation Survey Photogrammetry Aerial Photography Computer Aided Engineering Photogrammetry & Remote Sensing Survey Traffic Services Analysis Illumination ITS Signals Signing Pavement Marking Utilities and Railroads Utilities Railroads Based on the input from target users and the review of available vocabulary resources, the research team selected three initial facets for auto-classification: asset, master deliverables list and subject. Given the limited resources for the test and WSDOT’s interest in producing a stormwater-related pilot, we decided to focus the asset classification on culverts, and the subject classification on stormwater Best Management Practices (BMPs). Definitions of the selected facets and the scope of terms included are shown in Table 8. The research team collaborated with WSDOT staff to select the three facets and the terms that describe the content in each facet. The terms were sourced, in part, from the results of the Text Mining and Cluster Analysis, the TRT Thesaurus and the stormwater glossary. These terms were reviewed with subject matter experts to verify the relationships among terms and to add meaningful words and phrases to the vocabulary. The WSDOT staff confirmed the terms as applicable to the pilot.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 29 State Transportation Agencies Table 8. Selected Facets and Scope for Rule Development Facet Scope Definition/Description Asset Culvert A pipe or concrete box structure that drains open channels, swales, or ditches under a roadway or embankment. Master Deliverables List Master Deliverables The Master Deliverables List defines a three-level hierarchy of: Processes (level 1) – actions or steps to achieve an objective Work Groups (level 2)- specialty groups or major elements of work that will achieve milestones and produce deliverables Milestones (level 3) - a major point in a timeline that serves as a reference point for major project decisions or events Deliverable (level 3) – a product or service produced or offered as part of project development. BMPs Stormwater Management and Treatment BMPs The schedules of activities, prohibitions of practices, maintenance procedures, and structural and/or managerial practices, that when used singly or in combination, prevent or reduce the release of pollutants and other adverse impacts to state waters. Using an Ontology for Auto-Classification In the research phase of NCHRP 20-97, a commercial text analytics package was used to create a set of rules for assigning categories to documents in the corpus. Each commercial package uses a different proprietary language for creating rules. In this implementation phase, only open source products are being used to demonstrate methods that can be easily adapted to different transportation agency environments. The GATE open source tool – and its JAPE rule language were evaluated for this purpose. However, the research team decided to go in a different direction for the WSDOT test that would be easier to implement and integrate with search tools. Rather than writing rules, we created ontologies for each of the selected sets of terms – including equivalent terms (synonyms) and related terms. Then, rules for auto- classification were automatically generated based on the ontology entries. Like a taxonomy, an ontology describes the relationships among terms. In a taxonomy, parent and child relationships are arranged hierarchically. The relationships among terms in a taxonomy are “kind of” relationships or “is a” relationships. Thus, a culvert is a “kind of”

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 30 State Transportation Agencies drainage asset or similarly, a culvert “is a” drainage asset. These types of parent -child relationships occur throughout the hierarchy. Ontologies, however, offer more ways to describe the relationships among terms. Example relationships from the ontology used for the culvert branch are documented in Table 9. These relationships further define the attributes of a culvert and have a different meaning than a parent-child relationship in a taxonomy. Table 9. Relationships among Terms in an Ontology Relationship Properties Notation Example Has a hasA A culvert “has a” end treatment has shape of hasShapeof A culvert “has shape of” circular culvert Is part of isPartof An apron “is part of” a culvert Is made of isMadeof A culvert “is made of” concrete A culvert is made of metal For the pilot application of ontology, Protégé was selected as the authoring tool to create the ontology and to define the relationships among terms. Protégé is a free, open source ontology editor that provides a framework for building semantically rich systems. Culvert Ontology A culvert ontology was created based on material in the WSDOT Hydraulics Manual, the FHWA Hydraulic Design Series No. 5, and the WSDOT Engineering Publications Thesaurus. This ontology has four major branches: culvert shapes, culvert parts, culvert materials, and culvert end treatments. The initial version of the ontology includes hierarchical relationships across terms within these branches. Term definitions, equivalent terms, synonyms, and related terms were reviewed with subject matter experts. Figure 6 illustrates the top level of the ontology as displayed within Protégé.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 31 State Transportation Agencies Figure 6. Culvert ontology. MDL Ontology The MDL ontology was created by directly loading the MDL terms into Protégé. The relationships described in this node of the ontology is the “kind of” relationship and equivalent relationships. For example, Design deliverables are a “kind of” Master Deliverable. Also, each Master Deliverable has an equivalent numeric designation, e.g., Access Control is equivalent to D40. These equivalent relationships need to be documented in Protégé so that users can search by deliverable name or number. Once the parent-child relationships and the equivalent terms are entered this node of the ontology can be used to classify subsections of the engineering manuals. Stormwater BMP Ontology A Stormwater BMP ontology was created based on material in the WSDOT Temporary Erosion and Sediment Control Manual, the WSDOT Highway Runoff Manual, and the Washington State Department of Ecology Stormwater Manual for Western Washington. This ontology has two major branches – for temporary BMPs (used to manage impacts during construction) and permanent BMPs (used on an ongoing basis to manage stormwater impacts). A draft version of

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 32 State Transportation Agencies the Stormwater BMP Ontology was reviewed with subject matter experts who validated the structure and provided additional equivalent and related terms. Tagging Process The engineering manual content was automatically tagged using the terms in the ontologies described in the previous section. Each “chunk” or snippet of text was analyzed by examining the words in the “chunk” and comparing them to the words and concept relationships in the ontology. There are seven integrated technical components used for the process of tagging content. Six of these components are open source tools that are freely available and supported by a community of users. One (Taggr) is a custom-built integration component. Table 10 describes the functionality of these components. Table 10. Pilot Technical Components Technology Description Drupal Drupal is an open source web content management system. Drupal was selected for this pilot because it has a Book module. The Book module allows the book to remain intact so that users can continue to browse and read the book via a table of contents. Also, each page is a separate object allowing it to be indexed, categorized and display as a unique page in the book. MySQL MySQL is an open source relational database which uses the Structured Query Language to communicate with a database. SQL is used to insert data into a database, delete data from a database, query a database to find data, retrieve data from a database, and manage access to a database. Protégé Protégé is a free, open source platform used to construct knowledge domain models and knowledge-based applications with ontologies. Protégé is the tool that creates ontologies by describing the logical relationships among terms.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 33 State Transportation Agencies Technology Description Apache Jena Apache Jena is an open source application that uses the ontology created by Protégé to find and extract term relationships from the engineering manuals. When it finds a word or concept in an engineering manual, Jena creates and stores a triple which describes the relationship. An example of a triple is: A culvert “is a kind of” drainage asset or A culvert “is made of” concrete Jena stores these relationships in a triple store. A triple store is a type of graph database that stores semantic facts. A graph database contains a network of semantic facts with links between them. This makes a triple store the preferred container of managing highly interconnected semantic data. Drupal Taggr Module Taggr is a custom-built utility that integrates the various components pieces of the application. It transfers data about the meaning of a piece of content, matches document content with the concept relationships described in the ontology and writes the tag to the Drupal form which stores metadata about the document. Google Cloud NLP Google Cloud NLP reveals the structure and meaning of text through pretrained machine learning models. The application contains prebuilt modules that extract facts from sentences, identifies parts of speech, identifies entities such as a person, location, events, products, and media, and can classify content into predefined or proprietary categories. In this application Google NLP is used for lemmatization (reducing different forms of a word to a common base), removing duplicate words, and identifying meaningful terms for matching to the ontology. Solr Solr is an open source enterprise search platform, written in Java. Its major features include full text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases The process for tagging content in the WSDOT application is illustrated in Figure 7. Two process flows converge when a document is classified. The first process flow starts with the design of the ontology in Protégé. Authoring an ontology starts with building a taxonomy which is represented by parent-child relationships of classes. In an ontology, hierarchical relationships among classes are the same as the parent- child relationships in a taxonomy.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 34 State Transportation Agencies The second process flow starts when a new content object is added to the corpus. The content is uploaded to the MySQL database and triggers a workflow that applies tags to a document using the class relationships described in the ontology. Figure 7. Content classification flowchart. Apache Jena converts the word and concept relationships described in the ontology to subject- predicate-object triples. It stores the triples in a database called a triple store. To understand how these triples are used, it is necessary to understand the preprocessing of the text in the “chunk.”

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 35 State Transportation Agencies Figure 8 shows an example of how preprocessing occurs. Figure 8. Culvert Design Approach. When a “chunk” of content is ready for classification, it is uploaded to the MySQL database. Uploading a chunk of content triggers a sequence of steps that pre-process the textual data and eventually classifies the “chunk” with terms in the ontology. Taggr is an application that processes the content and manages the flow of information as the “chunk” makes its journey through the application classification workflow. The first processing

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 36 State Transportation Agencies step occurs in Taggr. Taggr changes all upper-case letters to lower case letters. Using lowercase letters removes ambiguity about a term and presents all terms with equal weight. Next, Taggr routes the “chunk” of content to the Google NLP engine. Google lemmatizes all the words in the “chunk” to create word stems and scores each word according to its salience in the “chunk”. The output from Google NLP is a list of word stems that are passed back to Taggr. Taggr continues the processing steps by removing duplicate terms and by dropping low saliency terms. The output of this step is a list of lowercase, high-saliency words and phrases that represent the meaning of the “chunk.” Taggr then passes the “chunk” of content to Jena where the list of words and phrases is matched with triples from the ontology. Jena finds the matches with the triples in the ontology and routes these terms back to Taggr. The matched terms are the tags that are applied to the content in the MySQL database. From the example in Figure 8 it is apparent that the key phrase in this document that drives the Asset classification is “open bottom or full culverts.” The triples from the ontology assert the following facts: • Culvert “isA” Drainage Asset • Drainage asset “isA” Asset • Culvert “hasShapeof” Open Bottom Culvert • An equivalent relationship exists between Open Bottom Culvert and Bottomless Arch • A “see also” or related term relationship exists between Drainage Asset and Drainage Channel Matching the terms with the triples in Jena creates the tags. Matched terms then become the tags that Taggr appends to the “chunk” of content in the MySQL database. In this case, the classification for Open Bottom Culvert is: Asset | Drainage Asset | Culvert| Shape | Open Bottom Culvert. In addition to these parent-child relationships, Protégé defines equivalent and see also relationships. “Bottomless Arch” is equivalent to “Open Bottom Culvert” and therefore added as a tag. Similarly, the related term “Drainage channel” is a see also reference for “Drainage Asset” and added as a tag. Testing the Tagging As in any development project involving software, testing and confirming the accuracy of automated tagging results is a critical part of the classification process. Validating the results is accomplished using a process that compares manually classified documents to documents classified by the computer. Figure 9 illustrates the validation process.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 37 State Transportation Agencies Figure 9. Content tagging validation process. Testing the ontology can also be done by comparing manually classified “chunks” of content with auto-classified content. The auto-classified content should match the tags of the content classified by the computer. The process of testing the ontology vs. manual classification should be conducted quarterly to ensure that the combination of tools required to tag content is accurately tagging content. Round 2 Ontology Modifications In round 2, the original ontology was modified to add four new facets for drainage system elements, materials, project development topics and Practical Solutions Lifecycle phase. In addition, the master deliverables list facet was modified to focus on a more limited set of deliverables, and relationships across terms were refined. Several other modifications were made to the ontology to add synonyms and further specify relationships across terms. The following types of relationships were included in the final ontology: • IsKindOf – used generally to represent sub-classes. Example: a Drainage Asset is a kind of Asset. • IsMadeOf – used to represent material composition. Example: a culvert is made of concrete. • IsEndTreatmentOf – used within the culverts facet to represent end treatments. Example: A Projecting End Section is an End Treatment of a Culvert.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 38 State Transportation Agencies • IsPartOf – used within the culverts facet to represent part-whole relationships. Example: An Invert is part of a Culvert. • IsElementOf – used to represent the different elements of the system. Example: a Biofiltration Swale is an element of the Collection and Conveyance portion of a Drainage System. • IsWorkStepOf – used within the Practical Solutions Lifecycle facet to indicate that the different phases are work steps of the entire lifecycle. • IsDeliverableOf – used within the Master Deliverable facet to indicate relationships between deliverables and their parent topics. Figure 10 illustrates some of the relationships in the ontology. Figure 10. Illustration of ontology relationships. Automated tagging for the drainage system elements and materials facets was implemented using the techniques developed in round 1. Automated processes for tagging Project Development Topic and Practical Solutions Lifecycle were not implemented. These facets contain terminology that is general in nature and would require a combination of rule-based, machine learning-based and manual tagging to be effective. The final updated set of facets and terms is provided in Appendix A. Drainage System Elements Facet Based on interviews with WSDOT’s stormwater experts, a new facet for drainage system elements was created. This new facet includes many of same items as the original culvert asset and BMP facets, but provides a functional view of the drainage system. It consists of three top level terms representing the key types of functional components of a drainage system:

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 39 State Transportation Agencies • collection and conveyance drainage system elements that collect water and convey it to another location • energy dissipation drainage system elements that help to limit erosion by reducing flow velocity • storage and dispersion drainage system elements that provide temporary or permanent storage for water or cause water to be spread over a wide area. Materials Facet In round 1, terms representing culvert materials were incorporated into the asset facet. In round 2, a separate materials facet was created, and relationships were created to identify which materials could be used for culverts (or parts of culverts). Synonyms for materials were created to reflect both common abbreviations (e.g. PVC for polyvinyl chloride). Project Development Topics Facet One goal of the round 2 tests was to add a topic facet to the manuals set that leveraged the results of the cluster analysis conducted in round 1. Rather than using the clusters directly to create a list of topics, the research team reviewed existing WSDOT documents to see if there was an existing topic list already defined and in use. We reviewed the WSDOT Deliverables Expectation Matrix, which defines deliverables expected by project phase based on different topic areas. As illustrated in Table 11, these topics aligned closely with the topics identified in the cluster analysis. We created the project development topics facet based on those that had already been defined in the Deliverables Expectation Matrix. Table 11. Match between Cluster Topics and Deliverables Expectation Matrix Topics WSDOT Deliverables Expectation Matrix Topic NCHRP 20-97 Cluster Analysis Topic(s) Access Management 13. Access Management Channelization and Pavement Marking Plans 5. Traffic Engineering 8. Traffic and Safety Community Engagement 3. Environmental/Project Review Cost Risk Estimating Management 4. Estimating Environmental Review, Permitting, & Documentation 3. Environmental/Project Review 15. Development Review 17. Traffic Impact Mitigation Geotechnical Recommendations 2. Structures/Geotechnical Hydraulics - Water Quality 9. Stormwater 12. Hydraulics Illumination, Signals, ITS 5. Traffic Engineering Pavement 2. Structures/Geotechnical Right Of Way 16. Right of Way

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 40 State Transportation Agencies WSDOT Deliverables Expectation Matrix Topic NCHRP 20-97 Cluster Analysis Topic(s) Roadway Geometrics and Plans 11. Plans 18. Geometrics Safety Analysis 8. Traffic and Safety Signing 1. Roadside 8. Traffic and Safety Specifications 7. Construction Contracts Structures 2. Structures/Geotechnical Survey & Mapping 14. Survey Temporary Erosion and Sediment Control 9. Stormwater Timelines Actions and Purpose 6. Consultant Services Utilities and Railroad 10. Utilities Work Zone Traffic Control 5. Traffic Engineering 8. Traffic and Safety Because the topic names are quite general, initial testing of the term-matching method used for other facets did not generally yield good results. Using the full multi-word phrases resulted in under-tagging of manual sections; matching manual content based on individual words (e.g. work, zone, traffic, and control) for “work zone traffic control” resulted in over-tagging. Using an ontology for classification works well when the content matches the words or concepts in the ontology, but the ontological approach is less accurate when the terms are general or ambiguous. Several other approaches to tagging could be considered for the future: • Manual tagging – this would likely be the most efficient approach for the WSDOT application given the size of the corpus and the number of terms to be assigned. It would be a relatively straightforward task to go through each manual chapter and scan for the 20 project development topics. • Tag based on clusters – this approach could leverage the cluster analysis from round 1. One could simply tag the manual sections based on the mappings shown in Table 11 – for example, any sections assigned to the “Utilities” cluster could be tagged with the “Utilities and Railroad” topic. A limitation of this approach, however, is that the cluster analysis assigns each section to only a single cluster, when in fact a given section could include content relevant to multiple topics.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 41 State Transportation Agencies • Tag based on rules – one could develop a set of rules to check for words and phrases that signal each topic. The results from the cluster analysis indicating common words could provide a starting point. • Tag based on machine learning model – one could create a training set with representative documents for each of the 20 topics, and then build a model that could check for similar content. This would not be an efficient approach for the WSDOT manuals application, but it might be worth exploring if WSDOT elected to expand the contents of the site. Another scenario that could make this approach worthwhile is if different state DOTs collaborated to create models for different topics and then share these models across agencies. Under this scenario, documents with agency-specific language would need to be excluded from the training sets. Practical Solutions Lifecycle Facet The Practical Solutions Lifecycle facet includes a list of the lifecycle stages provided by WSDOT: Establish Policy Framework, Identify Needs, Assess Alternative Strategies, Refine Solutions, Assign Resources, Develop Funded Solutions, Implement Solutions, Manage System Assets. As was the case for the project development topics, the Practical Solutions Lifecycle phases are too general and abstract for application of the ontology-based tagging approach. WSDOT agreed that a manual-based tagging approach for this facet is the most appropriate, at least in the short term. Use Case Analysis The final task of the round 2 test at WSDOT included creating a video that demonstrates the value of the site. This video is provided under separate cover. Background work to identify user types and user scenarios provide the groundwork for this video. This work is described below. The approach to identifying use cases began with identifying different categories of users who might take advantage of the integrated manuals site. The following set of user types were identified through review of WSDOT’s website personas and feedback from WSDOT manual stewards, manual users and members of the Manual Modernization project team: • Expert Design Engineer – a senior engineer – working in a region or Headquarters, fielding questions, responding to public disclosure requests • Early Career Design Engineer - working on projects – seeking information on design options, requirements and constraints • WSDOT Litigation Specialist - researching policies and guidance in place related to legal actions • WSDOT Policy Developer - exploring development of new policies – seeking to understand what is already in place • Business Partner (Rule Seeker) - developer or local agency staff – seeking to understand WSDOT requirements Representatives of each user type were identified, and interviews were conducted to obtain examples of search scenarios, as well as feedback on the site. The following scenarios were offered by the users.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 42 State Transportation Agencies • Expert Design Engineer ‒ Advising Design-Build project team about how steep they could make the side slope of a ditch to fit with the available right of way. (Design, Highway Runoff, Hydraulics). ‒ Advising design engineer on design method for ditch lining – would have looked in the Hydraulics Manual but the most appropriate formula for erosive shear calculation was in the TESC manual. ‒ Advising Design-Build project team seeking information on floodplain mitigation requirements. Hydraulics Manual – info on jurisdictional floodways; Environmental Manual – floodplain analysis requirements and permits/approvals; Highway Runoff Manual - mitigating loss of hydrologic storage. • Early Career Engineer ‒ Designing paving project through environmentally sensitive area, seeking information on barrier selection/design. (Roadside, Design, Highway Runoff). ‒ Seeking info on underdrain design (Design + Hydraulics). ‒ Seeking info on interchange design - needed protection barrier + way to channelize runoff (Design + Hydraulics). ‒ Seeking info on drainage design for project with multiple inlets flowing into a stream – need to calculate flows, evaluate need for ditches, BMPs (Design, HRM, Hydraulics). • Litigation Specialist ‒ Responding to an inquiry following a crash in which a vehicle rear-ended another vehicle that was stopped waiting to turn left. Seeking information on WSDOT policies/guidance on when to install left turn bays (Design Manual, MUTCD). ‒ Responding to a suit related to utility damage seeking information on policies that specify when and where a franchise utility can bury their lines. ‒ Responding to a suit related to crop damage from WSDOT maintenance herbicide spray, seeking information policies related to herbicide spraying operations Maintenance and Roadside Policy). ‒ Responding to a lawsuit related to property flooding in area where highway project was recently completed, seeking information on policies and procedures related to hydraulic design and erosion control (Environmental, Hydraulics, Highway Runoff, Maintenance). • Policy Developer (Active Transportation) ‒ Investigating creating policies to make crossing the street safer as part of an FHWA Every Day Counts effort. Looking for current policies related to use of street trees for traffic calming; seeking to identify which manual(s) have relevant existing policy/guidance. (Traffic, Roadside). ‒ Investigating current WSDOT policies related to setting target speed to determine potential changes for improving pedestrian and bicycle safety (Design and Traffic Manuals). • Business Partner ‒ Developer planning a project to improve a state highway off ramp for access to their site. They want to understand if they would be triggering any stormwater requirements,

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 43 State Transportation Agencies and if so, what the most cost-effective way to comply is. (Design, Highway Runoff, TESC). ‒ Local or federal agency is seeking to make an improvement involving both their roadway and a state highway wants to compare their design requirements against WSDOT’s to see which is more stringent. ‒ Citizen sees a sign on the highway “Bicycle must exit I-5”; wants to understand WSDOT policy about when to provide bicycle accommodations on limited access highways. Example Scripts The following scripts were developed based on the user scenarios identified. These scripts were used to create a video that demonstrate the value provided by the pilot site. Scenario 1: Ditch Slope During a Design-Build project meeting, the team is discussing the design of ditches for the project. They are wondering how steep they could make the side slope of the ditch to fit with the available right of way next to the roadway. They knew they didn’t have the room for a normal solution, and were looking for design constraints for the maximum tolerable side slope. a. They search for ditch and discover 1239.03 Side Slopes and Ditches in the Design Manual. This manual section references the Design Clear Zone policy. This provided a different, broader perspective on this design problem than what might have been found by looking in a stormwater-focused manual. The designers performed a clear zone analysis based on the Design Manual guidance, and used this to constrain the design. This section also includes guidance on the maximum slope maintenance prefers for mowing. b. If they were to continue looking, the next option is Hydraulics Manual – 4-3 Ditch Design Criteria. This talks about the maximum recommended ditch slope. c. Further down the list, Hydraulics Manual Section 5-3.1 Downstream End of Bridge Drainage talks about use of check dams for very steep ditches to reduce flow velocities, prevent erosion of the soil, and help to trap sediment from upstream sources. Scenario 2: Ditch Lining A design engineer was seeking guidance on how to determine the correct design method for ditch lining. a. They conducted a search for shear. b. The first article is from the TESC manual - 5-1.1.12 CONVEYANCE CHANNEL STABILIZATION and provided a formula for erosive shear calculation. The engineer would normally have consulted the Hydraulics Manual for this question, and would have spent a lot of time working through the Hydraulics materials (not well-suited to the task at hand) without the ability to do a quick search across the manuals.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 44 State Transportation Agencies Scenario 3: Floodplain Mitigation During a Design-Build project meeting, a question came up about floodplain mitigation requirements. They searched for floodplain. a. The first page of results are primarily from the Environmental Manual, chapter 432- Floodplains. b. If we filter by Hydraulics Manual, we get: c. Section 10-6.6 - with information on jurisdictional floodways. (Note that the title of Chapter 10 is “Large Woody Material” which provides little clue that one might find information on floodplain mitigation within this chapter). d. Now we filter by Highway Runoff Manual. e. We see sections from Chapter 2 – Stormwater Planning and Runoff Mitigation. f. So we see there were relevant results from three manuals: Design, Hydraulics and Highway Runoff – which would have been harder to find without an integrated way to search. Scenario 4: Underdrain Design A project had an issue with groundwater, needed an underdrain – wanted to understand policies and guidance related to underdrains. a. Search for underdrain. b. First hit is from chapter 8 of the Hydraulics Manual on pipe classifications and materials – provides basic info on function, dimension and design life of underdrain pipes. c. Section 6.9 talks about subsurface drainage in general and the different sizing methods for underdrains depending on the application. The second application involves installing underdrains in combination with a BMP or hydraulic feature such as: media filter drains, swales, ditches, and infiltration trenches. d. Use the side filters under BMPs to filter by infiltration trench. Results show specific Highway Runoff Manual sections providing further information on infiltration trenches. Observations The user interviews and scenario development activity demonstrated that having an integrated, searchable site of content from the eight manuals was adding value. Users indicated that a site like this would save them time searching from information. For users not that familiar with the different manuals, the site provided a convenient place to find answers to questions without needing to know what manual to consult. For existing manual users, the site provided a broader set of results than they would have otherwise found – encouraging a more holistic perspective for addressing particular design issues. The left side navigation facets were not heavily used by individuals who tested the site. In developing scripts to demonstrate how the site would be used to answer different user questions, we also found limited opportunity to add substantial value through use of the left side facets. Despite these findings, it would be premature based on the limited testing and interviewing conducted to conclude that these facets don’t provide value. We hypothesize that

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 45 State Transportation Agencies once users gain experience using the basic features of the site, they would begin to use the filtering options more frequently. The fact that users didn’t immediately begin using the filters underscores the importance of up- front design work involving target users. An iterative approach (such as the one that WSDOT employed) that provides users with an opportunity to test a pilot system is extremely helpful for eliciting information needs and informing the design of a production product. WSDOT’s implementation plan (below) identifies several aspects of the user interface design and the ontology tagging process that can be improved if WSDOT chooses to move forward with implementing the manuals site in production. One suggested future enhancement is to implement the ability to navigate the hierarchy within each facet. This would likely encourage more use of these facets. The current site only provides a flattened-out version of the terms in each facet allowing a user to only select the endpoint of a taxonomy branch rather than the parent terms. Search and navigation could be improved by allowing users to navigate up and down the hierarchy to find the most relevant result set for their purposes. Several lessons were learned through the testing process about the ontology tagging process. These are highlighted below: • Particular care is needed with respect to incorporating synonyms within the ontology – when the synonym (e.g. an acronym) has more than one meaning there may be unintended consequences – irrelevant sections may be tagged. • The Solr search engine can be configured to boost the relevance of sections that contain the query term in the section title and tag fields. Synonyms can also be specified to be used at search query time to supplement those included in the ontology that are used in the tagging process. • Term-matching algorithms need to balance handling of multi-word phrases in the ontology – matching on the entire phrase may result in under-tagging whereas matching on each word of the phrase results in over-tagging. For the WSDOT pilot, we erred on the side of under-tagging, and required a match on the entire multi-word phrase. • The tagging process implemented for the WSDOT site makes use of Google NLP to filter ontology matches based on extracted terms. The purpose of this is to limit the sections tagged to those which are most “about” the selected ontology term. However, the entities identified by Google NLP do not always match the ontology terms. For example, Google NLP may extract the term “storm sewer systems” but not “storm sewer” – which is the ontology term. The result is under-tagging of sections. Use of Google NLP’s salience scores further limits what terms are matched to the ontology – salience scores assigned based on part of speech analysis and machine learning models. Salience cutoff values were not used for the final tagging process because of this concern. While Google NLP was adequate for the purpose of creating a pilot, it would be appropriate for a production development effort to consider other available NLP tools, and build in the effort needed for testing and configuration to work well with the specific body of content included.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 46 State Transportation Agencies 2.4 Washington State DOT Implementation Plan The purpose of the implementation plan is to provide WSDOT with a roadmap for future development and application of the techniques demonstrated in this test. Introduction WSDOT participated as a test agency for NCHRP 20-97: Improving Findability and Relevance of Transportation Information. The WSDOT test was designed to add value to a parallel pilot at WSDOT to build an interactive, integrated body of engineering manual content for eight manuals with stormwater-related content. The NCHRP 20-97 test consisted of: • Automated content preparation or “chunking” of existing PDF manuals into HTML files representing manual sections and subsections; • Text analysis including text mining to identify common terms and cluster analysis to identify common themes and interconnections across different manual sections; and • Demonstration of the use of an ontology (a semantic model representing the formal naming and definition of the categories, properties, and relations between the concepts, data, and entities in a manual) to tag manual sections and facilitate search and navigation. The ontology developed includes drainage assets, drainage system elements, master deliverables (from WSDOT’s Master Deliverables List), and stormwater BMPs. Text analysis was conducted on an expanded list of 18 manuals to build a broad sample of words and concepts in the engineering manuals, – 10 more than the eight included in the pilot. The purpose of the implementation plan is to provide WSDOT with a roadmap for future development and application of the techniques demonstrated in the NCHRP 20-97 test. It is structured into four workstreams, summarized in Table 11 and described below. Table 11. WSDOT Implementation Plan Overview Task Explanation A. Continue to Advance WSDOT’s Vocabulary, Metadata Management and Text Analytics Capabilities A1. Establish a vocabulary and metadata integration strategy Determine WSDOT’s future approach to integrating metadata and vocabulary management for databases, web content and other content management systems. A2. Identify staff responsibilities and build expertise Through training, coaching, and (where feasible), strategic hires, continue to develop and broaden staff capacity within vocabulary management and text analytics.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 47 State Transportation Agencies Task Explanation A3. Select and implement vocabulary management and text analytics tool(s) Review available commercial and open source packages; develop requirements, select and procure solution, train staff. A4. Create metadata schema for manual sections Identify the metadata elements that will provide the standard information vocabulary used to describe the content. The schema assures that the terms can be consistently applied and reused across an organization and interpreted by human users and other computer applications. B. Test the Pilot Technology Solution and Specify Enhancements B1. Install pilot and update content Obtain, install and configure the pilot technologies and content database within WSDOT’s environment. Update content for manuals that have changed since the pilot was created. B2. Gather user feedback Continue to seek end-user feedback, compile user stories and reported benefits, compile enhancement suggestions. B3. Evaluate architecture and functionality Test and evaluate each of the components of the solution including those supporting tagging, search and navigation. B4. Specify enhancements Based on user feedback and solution evaluation, identify corrections and enhancements to the pilot solution. C. Develop and Rollout Production Solution C1. Update requirements Review and revise original requirements established for the pilot site. C2. Establish development project Create a project plan and secure development resources. C3. Build the production version Develop and test the solution. Work with the developers to correct classification errors and to enhance the ontology to meet user requirements. C4. Establish tagging process Put a process in place for periodic updates to content tags.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 48 State Transportation Agencies Task Explanation D. Add Content and Adjust Ontology and Facets D1. Identify candidate content Select the next set of manuals (and related documents) to be included in the solution. D2. Select and prepare content Assemble source files, update and apply conversion scripts, convert content to html pages. D3. Import and validate content Ingest HTML pages, proofread pages to validate import, edit files. D4. Update text analysis to reflect expanded corpus Update text analysis to confirm the meaningful terms and concepts that occur in the corpus, rerun text mining for word and phrase frequency, update cluster analysis, review results, refine as needed, compare with TRT and other established controlled vocabularies. D5. Expand controlled vocabulary/ontology Incorporate terms from text mining and cluster analysis, document term relationships, review with subject matter experts. Update the vocabulary management system with additional terms. Implementation Activities A. Continue to Advance WSDOT’s Vocabulary, Metadata Management and Text Analytics Capabilities Maturing toolsets and skillsets for text analytics and establishing an integration strategy will provide a strong foundation for successful implementation and sustainability for the manual modernization effort and other findability improvements. A.1 Establish a vocabulary and metadata integration strategy WSDOT has published the report: “Words Matter: Managing Vocabulary Resources to Support Productivity” This report recommends implementation of WSDOT’s core metadata, continued development of the agency’s glossary, taxonomies and thesaurus, and development of methods to integrate the agency’s thesaurus into agency search tools. WSDOT can move forward with these recommendations and specifically address where the “master” source for agency vocabulary (including glossary definitions and related terms) will be maintained, and how this master source will be kept in sync with other systems (e.g. data catalog, locally maintained taxonomies, synonym lists used for search, glossaries on the website and individual manuals, etc.). This will involve both technical solutions (e.g. new thesaurus management software and integration components) as well as governance and workflow elements.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 49 State Transportation Agencies One option to consider is using Protégé as the master source for agency terminology. This tool can store glossary definitions, synonyms, and other term relationships. A governance process would be established for controlling additions and modifications to the master terminology source. Synchronization processes would be created to push updates to other systems that use or publish terminology. These systems include WSDOT’s data catalog and WSDOT’s glossary (published on the internet site). A.2 Identify staff responsibilities and build expertise Identify the key staff who will be leading and supporting vocabulary management and text analytics activities. Staff with requisite skills include librarians, content managers, documentation specialists, data managers, records managers, web site managers and other information technology professionals. Develop training and mentoring plans to allow them to perform these tasks in an effective and efficient manner. Specific areas of knowledge and expertise include: • Fundamentals of information retrieval methods and architectures • Information architecture – including user information interaction patterns, information seeking behaviors, usability, design of information organization and navigation systems • Understanding of differences between web search and enterprise search • Search user interface design • Information classification and taxonomy, thesaurus and ontology development and uses • Types and uses of metadata • Search engine components and mechanics including crawlers, connectors • Understanding of relevancy ranking, search tuning, search engine optimization • Understanding of indexing methods, use of inverted indexes • Familiarity with text analytics and machine learning methods and tools • Awareness of commercial and open source search and taxonomy management products and features; ability to evaluate applicability for specific purposes A.3 Select and implement vocabulary management and text analytics tool(s) WSDOT is currently using an inexpensive thesaurus management tool called MultiTes. MultiTes is ISO-compliant and fit for purpose, but has limitations, particularly concerning exports and integration with other systems. Since MultiTes was adopted as an interim solution, next steps are to develop a set of requirements for vocabulary management and evaluate alternative solutions. Because available solutions for vocabulary management are increasingly included as part of more comprehensive text analytics packages, it makes sense to broaden the evaluation of possible new tools to include capabilities for text mining and NLP text classification, and semantic search. There are many options – both commercial and open source, at different price points and with different feature sets. A list of text analytics tools was compiled in NCHRP Report 846, Appendix E. NCHRP Project 20-97 tested two commercial products (Smartlogic and Google NLP) and two open source products (Protégé for ontology authoring and the Python Natural

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 50 State Transportation Agencies Language Toolkit.) The Transportation Research Board has selected PoolParty for managing the TRT following an evaluation of products conducted as part of NCHRP Project 20-109 and documented in NCHRP Report 874. This report, “The TRT: Capabilities and Enhancements,” Appendix C (Dimension E) provides a set of requirements for thesaurus management software which may also be a useful resource. A build versus buy analysis can be conducted based on the evaluation, leading to selection of a solution (or combination of solutions, as appropriate). Modifications, upgrades and interfaces to WSDOT’s current data catalog should be considered as part of this analysis. If multiple solutions are selected, it will be important to specify and implement integration across the different tools. Following tool selection, members of the vocabulary team and others who will be responsible for maintaining controlled vocabularies and performing text analysis will need to be trained on how to use the tool(s). A.4 Create metadata scheme for manual sections WSDOT has defined core and extended metadata elements for its content which covers the full manuals themselves. However, there is a need to define additional metadata that will be maintained at the manual section level – i.e., for each .HTML page of content. Once identified and defined, these metadata elements can then be built into the web content management system data structure. These may include: Descriptive Metadata Elements • Section or subsection title • Author • Content owner Administrative Metadata Elements • Date last modified • Version number Facets for Navigation and Filtering • Topic • Mode • Asset • Master Deliverables • Stormwater Best Management Practices (BMPs) • Drainage system elements Once elements are selected, value domains can be established and documented. For example, if “mode” is selected as a facet, a controlled list of modes can be created. Note that creating and gaining agreement on value domains can be time consuming. The level of effort will vary by facet. For example, WSDOT’s Master Deliverables List is already well- defined. On the other hand, further work is needed to establish value domains for assets – building on current and prior Department efforts in this area.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 51 State Transportation Agencies B. Test the Technology Solution and Specify Enhancements B1. Install pilot and update content Obtain and install the prototype application within WSDOT’s environment. The site requires the following technologies to run the pilot site: Drupal 7.0, MySQL, Drupal Book module, and Solr. WSDOT already has these technologies in place, but may need to establish a separate Drupal database and a separate instance of Solr (or a separate index) with customized configuration options. Once the pilot site is installed, WSDOT will need to test it and resolve any technical issues encountered. Once the software is installed, WSDOT may wish to update the content within the Drupal book module so that users testing the pilot will be seeing up-to-date material. This will be best accomplished by identifying the specific chapters that have been updated since the content was originally loaded and manually cutting and pasting the updated content into Drupal. To test the tagging process used for the pilot, the following additional technologies are needed: Protégé, Apache Jena, Drupal Taggr Module and Google Cloud Natural Language (Google NLP). Protégé and Apache Jena are no cost open source tools. Google NLP is a for- fee service, with fees dependent on usage levels. The Taggr module is a custom software module that integrates Jenna, Google NLP and Drupal book. Testing the Tagging process will involve updating the ontology in Protégé, exporting the ontology as an .OWL file, uploading the .OWL file int Apache Jena, and running Taggr to match ontology terms to the output of Google NLP and update the tags in Drupal. B2. Gather user feedback A limited set of user interviews were conducted as part of WSDOT’s manuals pilot. As WSDOT moves beyond the pilot and seeks to expand the content to a new set of manuals, additional interviews can be conducted to better understand how different types of users discover and search for manual content and related guidance. These interviews can be used to prioritize content to be added to the site and identify additional facets to be included to facilitate search and navigation. These interviews would cover: • Use cases – describe situations in which the user seeks out content from one or more manuals. • User stories –Describe a software feature from an end-user perspective, i.e., describe the type of user, their goal, and why they want to achieve the goal. • Current discovery and search methods – understand current methods for discovering and searching for content, including search terms. • Desired discovery and search methods – test ideas for potential future ways of navigating through content and seek input. • Identify cases where the user would get value from searching across multiple manuals (as opposed to searching within an individual manual).

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 52 State Transportation Agencies Summarize and validate findings Summarize key findings of the interviews, and answer the following questions: • Are there particular clusters of manuals that users want to search horizontally? • Are there search facets or terms that appear to be common across multiple users? • Are there other issues that prevent users from finding the information they are seeking that need to be considered? Review findings with manual stewards, vocabulary team members, and other stakeholders. Discuss: • Priorities for adding content (based on common clusters) • Options for adding search facets • Options for expanding controlled vocabulary (glossary terms and term relationships) • Other enhancements to search and navigation The user stories collected through this process should help WSDOT to document the potential benefits of such a site to the agency and assess whether building a production version of the site and maintaining it over time is worth the investment required. The discussion of new content and enhanced features can be used to create a list of enhancements can be considered for inclusion in the production version. B3. Evaluate architecture and functionality Evaluate both the navigation design and the tagging process and determine whether they want to make modifications. In evaluating the tagging process, the following should be noted: • The entities identified by Google NLP do not always match the ontology terms – for example, Google NLP may extract the term “storm sewer systems” but not “storm sewer” – which is the ontology term. Therefore, a full text search for a given ontology term may find more instances of that term than those that are tagged. • Google NLP’s text analysis feature is not free –the cost of tagging content is calculated on a per item basis. • The current process does not take full advantage of the relationships within the ontology (e.g. tagging with parent terms if the text includes any child terms) because of the need to fine tune logic for different ontology branches to avoid over-tagging. Additional customization of the tagging process could address this issue. • Commercial solutions (such as PoolParty) are available that incorporate semantic search capabilities that leverage available ontologies. If WSDOT chooses to acquire one of these solutions, the provided capabilities could be leveraged. B4. Specify enhancements Based on the user feedback and results of the evaluation of the pilot architecture and functionality, identify needed corrections to existing functionality as well as desired enhancements. The following enhancements can be considered:

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 53 State Transportation Agencies • Implement the ability to view and navigate parent-child relationships built into the ontology. Currently the facets along the left side of the pilot site do not show the hierarchy of terms. An enhanced navigation feature would enable WSDOT to better leverage the value of the ontology that has been developed. • Implement “best bets” for common searches to make sure that preferred manual sections appear first in the search results. • Modify the tagging process to eliminate use of Google NLP – or to replace Google NLP with a customized process for extracting relevant entities for matching. • Modify the tagging process to better leverage the ontology relationships. For example: − Tag all “kinds of” drainage assets with “drainage asset” − Tag all “kinds of” metal with “metal” − Do not tag “parts of” culverts with “culvert” (since these parts don’t necessarily signify that the text is talking about culverts) • Develop new tagging processes for some facets that are not amenable to tagging based on matching of literal ontology terms. For example, the Project Development Topic and Practical Solutions Lifecycle facets contain terms that are too general and too common for the term-matching approach to work. Rule-based or machine learning-based methods can be considered for creating candidate tags for manual sections. These automatically generated tags should be validated by a subject matter expert. Rule-based auto-categorization can add refinements beyond identifying particular terms and synonyms – such as considering the location of terms within the text, applying different weightings to different terms (reflecting stronger or weaker evidence of a concept), and considering negative evidence as well as positive evidence. Machine learning methods may be applicable as the size of the overall corpus grows. These methods involve use of training sets to train models to look for documents with similar content. C. Develop and Rollout Production Solution C1. Update requirements Update the requirements created for the pilot to incorporate the desired enhancements. Prioritize the functionality to be implemented in the initial production version of the site. C2. Establish development project Determine how this functionality will be built (in-house, contract support or a combination). Procure development services as needed. C3. Build the production version Modify the pilot version to add new features. Modify the tagging process to incorporate the modified technology components (if applicable). Conduct system testing, and user testing. Migrate the database and application to a production environment. Develop a communication plan that informs users about the new site and demonstrates its capabilities. Identify a contact person to receive questions and comments.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 54 State Transportation Agencies C4. Establish tagging process Tags will need to be refreshed whenever new content is added OR adjustments to the ontology are made. If content is being added continuously, re-tagging should occur nightly or weekly. If content is being added relatively infrequently in large batches, re-tagging can be timed to coincide with the content additions. Whenever the ontology or the automated tagging processes are modified, the tagging process should be validated by asking subject matter experts to tag selected content, and then compare the auto-generated tags with those assigned by the experts. Based on the results of this process, make adjustments to correct systematic issues. Retest the tagging process to verify that the changes made behave as expected. Supplement the automated tagging with manual tagging as needed, with assistance from subject matter experts. D. Add Content and Adjust Ontology and Facets These activities can be initially conducted in conjunction with the creation of the production site, and then repeated periodically. Alternatively, the production site can be rolled out with the existing content, and then new content can be added as resources allow. D1. Identify candidate content The current pilot includes content for eight of WSDOT’s manuals. In conjunction with development of the production site, an expanded set of manuals should be prepared and loaded into the system. WSDOT will need to identify which manuals should be added. Candidates identified through user interviews in the pilot include the traffic manual, the maintenance manual, and the standard plans and specifications manuals. D2. Select and prepare content This activity involves the preparation of content for each of the manuals identified for inclusion in the manual site. Record metadata for each manual Initially, many of the metadata elements (e.g., author, content owner, date of the last update) will be the same for each section of a given manual. These metadata elements should be recorded for each manual to be included so that they can be populated within the web content management system. Prepare the content for conversion Content will need to be prepared before conversion: • Find the latest Microsoft Word document(s) for the manuals. If Microsoft Word source files are not available, find the latest Adobe PDF files. Microsoft Word files are preferable, since transforming PDFs to HTML creates additional formatting issues that require manual correction. • Accept track changes in the documents. • If Adobe PDF files are used, create bookmarks for each subsection to be included.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 55 State Transportation Agencies Decompose the manuals into “chunks.” Chunks are intact sections and subsections of manual content. A script was created as part of the NCHRP 20-97 test to decompose the manuals and output html pages. This script should be evaluated, tested, and refined as needed. The refined script should be applied to create a set of HTML files from the manuals. Manual “chunking” of some content may be required if the structure of the table of contents, section headings and other structural components of the document differs from other manuals ingested into the web content management system. Commercial alternatives to the “chunking” script can also be considered, such as: http://wordflow.info/services/pdf-to-web-conversion/. Stage the HTML files to be imported into the Drupal content management system. Exclude any chunks that do not need to be included (e.g., boilerplate or comment pages that make sense in the context of a separate manual but not for a body of web content). D3. Import and validate content Import the HTML files into the Drupal database Import the HTML files created from the manual content into WSDOT’s Drupal web content management system (CMS) database. Validate and Edit the HTML files Review each HTML file to make sure that it includes the same text and graphics as the manual source files, and that it presents correctly within Drupal. Edit the files as needed. Note any systematic issues with the import that may be corrected through initial content preparation steps or further updates to the chunking script. D4. Update text analysis to reflect expanded corpus The NCHRP 20-97 test involved using text mining to identify commonly used terms and cluster analysis to identify common themes. This analysis included 18 manuals and can be updated if the content is expanded beyond the initial sample. This analytical step can be accomplished in parallel with activities 4 and 5. The following steps provide a high level overview of this process. Prepare Content for Text Analysis Text analysis is best done at the manual or manual chapter level. Each manual chapter must be converted to text format using Python tools PDFminer (for PDFs) and docx2txt (for Microsoft Word) or other commercially available tools. Update the Text Mining (Term Extraction) Use the Countvectorizer function in the Python SCIKIT-Learn module to extract features from the text files. Prepare a frequency distribution to identify the most common terms. Review the list to distinguish the terms that are candidates to be added to a controlled vocabulary.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 56 State Transportation Agencies Update the Cluster Analysis Starting with the prepared text files, apply tools in the Python Natural Language Toolkit (NLTK) to remove punctuation and stop words and perform stemming (convert words to their root form.) Then, set up model analysis parameters including the number of desired clusters, and cutoffs that determine which terms will be ignored based on appearing in either too few manuals/chapters or too many manuals/chapters. To replicate the model used in the NCHRP 20-97 test, use the k-means algorithm with a "tf-idf matrix" - term frequency - inverse document frequency matrix. Run the model, review the results, and refine as needed. D5. Expand controlled vocabulary/ontology The NCHRP 20-97 test created a small ontology to demonstrate how a controlled vocabulary of terms and relationships can be used to facilitate search and navigation. The test also examines the integration of glossary terms to provide easy access to definitions as users read the manual content (e.g., through bringing up the definitions when users “hover over” a term.) Over time, WSDOT should continue to build its controlled vocabularies and integrate these into its web search environment for manuals (and other resources.) Determine the need for new or modified search facets Use the cluster analysis to identify related terms for “See Also” references and use the text mining results to evaluate the importance of the term as it is used in all manuals. Select facets from the list of metadata items for development. Review and confirm the selections of new facets with business representatives. Build the controlled vocabulary/ontology Use the manual content, supplemental technical sources (e.g., from AASHTO or FHWA), and other semantic resources (e.g., the TRT) to create a draft taxonomy of terms with hierarchical relationships. Describe the relationships – for example: • Culvert “isMadeof” concrete • Culvert “control” stormwater runoff • End Treatments “ispartof” a culvert Incorporate terms from the text mining results and the current WSDOT glossary/thesaurus. Identify equivalent terms (synonyms) and related terms. Populate definitions from the WSDOT glossary and other available authoritative sources. Record sources for each term and definition. Review, validate and update the draft with subject matter experts. Submit a proposed addition to WSDOT’s glossary through the established governance process. Update terms and relationships in the vocabulary management system The NCHRP 20-97 test used the open source product Protégé to build an ontology. Use this or an equivalent product to describe the new terms and relationships. Add the parent-child

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 57 State Transportation Agencies relationships first. Then, enter labels for each of the classes (terms) created. Next, add equivalent terms (synonyms), acronyms, numeric codes and related terms. Resourcing and Phasing The manuals pilot project helped to create a long-range vision for improving the findability of content at WSDOT. This vision cannot be achieved overnight; it must be implemented iteratively in phases over time. The timeframe and phasing of implementation will depend on the level of available funding and human resources. Possibilities for phasing of the interactive manuals site are: • Roll out by subject area – based on defined subject domains (possibly to be informed by the cluster analysis); • Roll out by content type – starting with manuals, moving on to Executive Orders and other web content; and • Roll out by business priority – e.g., asset management, One Washington Enterprise Resource Planning (ERP) project. Regardless of the phasing strategy, it will be important to assess and document the value added or return on investment based on the initial implementation. This can be accomplished through conducting surveys and interviews to quantify time savings and to gather anecdotes about benefits. Documented benefits may include reduced complaints about intranet search, reduced time searching, the discovery of new resources, and reduced redundancy in content. Roles and Responsibilities The following outline of roles and responsibilities can be used as a starting point for assigning implementation tasks to groups and individuals. Knowledge Services • Provide leadership and staff support for building WSDOT’s vocabulary resources and using these resources to enhance search and discovery. • Steward development and implementation of text analysis methods as an emergent practice at WSDOT. • Collaborate with the WSDOT Library to transfer glossary management when it is operational. • Work with business users to identify requirements. Manuals Project Team • Serve as a steering committee for implementation of the production manuals web site. • Set priorities for phasing of manual content additions. • Engage manual stewards and the manual user community to provide input on design decisions and validate testing activities. Vocabulary Team • Serve as a governance group for new term additions.

NCHRP Web-Only Document 279: Information Findability Implementation Pilots at 58 State Transportation Agencies • Oversee text analysis/controlled vocabulary development. • Oversee development of requirements for a vocabulary management tool. • Oversee repurposing controlled vocabularies from other WSDOT business units or external sources. Communications/Web Team • Lead responsibility for look and feel of web pages; implementation of new facets (in collaboration with a metadata and taxonomy analyst). • Ingest new web content. • Implement ontology-based tagging feature in production. • Search configuration and tuning based on search log analysis (in collaboration with a metadata and taxonomy analyst). Data Management • Lead the definition of a metadata schema for manual pages (WSDOT’s metadata and taxonomy analyst). • Manage requirements and selection process for a vocabulary management tool. • Lead adoption and application of text analytics software (Python libraries and/or new commercial tools.) • Lead the adoption and updating of the text chunking utility and/or investigate commercial applications and services that do this more robustly. • Lead the conversion of manual content into an HTML format ready for ingestion into Drupal. Content Owners/Authors • Perform manual tagging of content. WSDOT Business Units • Provide business ownership of value domains for different facets – for example, the Master Deliverables List. • Sign off on web content additions – e.g., Assistant Secretary for Multimodal Development and Delivery responsible for manual content. • Perform validation testing of the manual search and navigation interface.

Next: 3.0 Utah DOT Findability Tests »
Information Findability Implementation Pilots at State Transportation Agencies Get This Book
×
 Information Findability Implementation Pilots at State Transportation Agencies
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With a quick online search, you can discover the answers to all kinds of questions. Findability within large volumes of information and data has become almost as important as the answers themselves. Without being able to search various types of media ranging from print reports to video, efforts are duplicated and productivity and effectiveness suffer.

The TRB National Cooperative Highway Research Program's NCHRP Web-Only Document 279: Information Findability Implementation Pilots at State Transportation Agencies presents the results of pilot applications of findability techniques at the Washington State Department of Transportation, the Utah Department of Transportation, and the Iowa Department of Transportation.

The document is supplemental to NCHRP Research Report 947: Implementing Information Findability Improvements in State Transportation Agencies and three videos on the Washington State DOT Manual Modernization Pilot. .

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!