National Academies Press: OpenBook
« Previous: Chapter 5 - Findability Techniques
Page 30
Suggested Citation:"Glossary." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 30
Page 31
Suggested Citation:"Glossary." National Academies of Sciences, Engineering, and Medicine. 2020. Implementing Information Findability Improvements in State Transportation Agencies. Washington, DC: The National Academies Press. doi: 10.17226/25884.
×
Page 31

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

30 The following glossary defines technical terms that are used in this report or the companion technical memorandum. Note: NCHRP Research Report 846 also includes an extensive glossary of terms related to findability. Auto-classification. Techniques for automating classification of content based on content type or subject area using rule-based or machine learning-based methods. Cluster analysis. A family of unsupervised machine learning techniques that organize a set of documents into groups based on the words they contain. Content. Information that has been packaged in a format suitable for retrieval, re-use, and publication. Content includes documents, data sets, web pages, image files, email, social media posts, video files, audio files, and other rich media assets. Content type. A way of classifying content from a functional standpoint, independent of file format. Controlled vocabulary. A list of terms that have been enumerated explicitly. This list is controlled by and available from a controlled vocabulary registration authority. Entity extraction. A process of identifying and classifying elements from text into pre-defined categories (e.g., people, places, dates, and project numbers). Faceted classification. A system for organizing content into categories based on a systematic combination of mutually exclusive and collectively exhaustive characteristics of the materials (facets) and displaying the characteristics in a manner that shows their relationships. Faceted navigation. Technique for accessing content based on a faceted classification system. Findability. The degree to which relevant information is easy to find when needed; findability is improved through application of metadata, taxonomies, and other organizing tools, and search technologies. Full-text search. A capability to retrieve a set of documents containing a search term or phrase based on comprehensively scanning the full content of a set of documents or databases. Lemmatization. The process of reducing the different forms of a word to one single form (e.g., “playing,” “player,” “plays” would be reduced to the lemma “play”). Machine learning. A branch of artificial intelligence (AI) involving creating computer algo- rithms that learn from and make predictions about data. Supervised machine learning tech- niques require data sets that people have labelled (training sets). Unsupervised machine learning algorithms identify patterns in data automatically, without human input. Glossary

Glossary 31 Metadata. Data describing context, content, and structure of documents and records, and the management of such documents and records over time. Literally, data about data. Natural language processing (NLP). Techniques that enable computers to analyze and under- stand human language; includes processes to identify parts of speech (nouns, verbs, adjectives, etc.) using a set of lexicon rules. Ontology. A type of controlled vocabulary that describes objects and the relations between them in a formal way and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. Parsing. The process of determining the syntactic structure of text by analyzing its constituent words based on the underlying grammar. The output of the parsing process is a parse tree in which the sentence is the root and intermediate nodes are noun phrases and verb phrases. This decomposition of sentences assists NLP programs to determine the meaning of a sentence. Solr. An open source enterprise search platform, written in Java. Its major features include full-text search, hit highlighting, faceted navigation, real-time indexing, dynamic clustering, database integration, and rich document (e.g., Word and PDF) handling. Providing distributed search and index replication, Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases. Stemming. Stemming reduces different forms of a word to a common root. For example, the root form of “driving,” “drives,” and “driven” might be “driv.” It is similar in concept to lemmatization. However, lemmatization results in an actual language word, whereas stemming creates a root that may or may not be a word. Tagging. The process of assigning a label to a document or other unit of content (e.g., web page) to facilitate search or understanding. Taxonomy. A type of controlled vocabulary consisting of categories and subcategories, used for classifying information. Term frequency – inverse document frequency (tf-idf). A statistic used in text mining that represents the relative importance of different words in a body of content. The term frequency is calculated as the number of occurrences of a term within a document divided by the total number of words in the document. The inverse document frequency for a term is the total number of documents divided by the number of documents containing the term. The tf-idf statistic is calculated by multiplying the term frequency by the inverse document frequency. Text analytics. Techniques that utilize software and semantic resources to add structure to text-based content objects (e.g., text files, Word documents, and websites). The main capabilities of text analytics include text mining, sentiment analysis, entity or noun phrase extraction, auto-summarization, and auto-categorization. Text mining. A set of techniques used to extract information from text; includes tokenization, lemmatization and stemming, removal of stop words and punctuation, part of speech tagging, mapping of word frequencies, analysis of word co-occurrences, and cluster analysis. Tokenization. The process of segmenting running text into words, phrases, and sentences. Tokenization is required before text processing can be done by a computer. Segmentation of words and phrases is done to detect meaningful patterns of words that will be used by other natural language algorithms to compute the meaning of the text.

Next: References »
Implementing Information Findability Improvements in State Transportation Agencies Get This Book
×
 Implementing Information Findability Improvements in State Transportation Agencies
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

With a quick search online, you can discover the answers to all kinds of questions. Findability within large volumes of information and data has become almost as important as the answers themselves. Without being able to search various types of media ranging from print reports to video, efforts are duplicated and productivity and effectiveness suffer.

The TRB National Cooperative Highway Research Program's NCHRP Research Report 947: Implementing Information Findability Improvements in State Transportation Agencies identifies key opportunities and challenges that departments of transportation (DOTs) face with respect to information findability and provides practical guidance for agencies wishing to tackle this problem. It describes four specific techniques piloted within three State DOTs.

Additional resources with the document include NCHRP Web-Only Document 279: Information Findability Implementation Pilots at State Transportation Agencies and three videos on the Washington State DOT Manual Modernization Pilot.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!