Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
30 The following glossary defines technical terms that are used in this report or the companion technical memorandum. Note: NCHRP Research Report 846 also includes an extensive glossary of terms related to findability. Auto-classification. Techniques for automating classification of content based on content type or subject area using rule-based or machine learning-based methods. Cluster analysis. A family of unsupervised machine learning techniques that organize a set of documents into groups based on the words they contain. Content. Information that has been packaged in a format suitable for retrieval, re-use, and publication. Content includes documents, data sets, web pages, image files, email, social media posts, video files, audio files, and other rich media assets. Content type. A way of classifying content from a functional standpoint, independent of file format. Controlled vocabulary. A list of terms that have been enumerated explicitly. This list is controlled by and available from a controlled vocabulary registration authority. Entity extraction. A process of identifying and classifying elements from text into pre-defined categories (e.g., people, places, dates, and project numbers). Faceted classification. A system for organizing content into categories based on a systematic combination of mutually exclusive and collectively exhaustive characteristics of the materials (facets) and displaying the characteristics in a manner that shows their relationships. Faceted navigation. Technique for accessing content based on a faceted classification system. Findability. The degree to which relevant information is easy to find when needed; findability is improved through application of metadata, taxonomies, and other organizing tools, and search technologies. Full-text search. A capability to retrieve a set of documents containing a search term or phrase based on comprehensively scanning the full content of a set of documents or databases. Lemmatization. The process of reducing the different forms of a word to one single form (e.g., âplaying,â âplayer,â âplaysâ would be reduced to the lemma âplayâ). Machine learning. A branch of artificial intelligence (AI) involving creating computer algo- rithms that learn from and make predictions about data. Supervised machine learning tech- niques require data sets that people have labelled (training sets). Unsupervised machine learning algorithms identify patterns in data automatically, without human input. Glossary
Glossary 31 Metadata. Data describing context, content, and structure of documents and records, and the management of such documents and records over time. Literally, data about data. Natural language processing (NLP). Techniques that enable computers to analyze and under- stand human language; includes processes to identify parts of speech (nouns, verbs, adjectives, etc.) using a set of lexicon rules. Ontology. A type of controlled vocabulary that describes objects and the relations between them in a formal way and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. Parsing. The process of determining the syntactic structure of text by analyzing its constituent words based on the underlying grammar. The output of the parsing process is a parse tree in which the sentence is the root and intermediate nodes are noun phrases and verb phrases. This decomposition of sentences assists NLP programs to determine the meaning of a sentence. Solr. An open source enterprise search platform, written in Java. Its major features include full-text search, hit highlighting, faceted navigation, real-time indexing, dynamic clustering, database integration, and rich document (e.g., Word and PDF) handling. Providing distributed search and index replication, Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases. Stemming. Stemming reduces different forms of a word to a common root. For example, the root form of âdriving,â âdrives,â and âdrivenâ might be âdriv.â It is similar in concept to lemmatization. However, lemmatization results in an actual language word, whereas stemming creates a root that may or may not be a word. Tagging. The process of assigning a label to a document or other unit of content (e.g., web page) to facilitate search or understanding. Taxonomy. A type of controlled vocabulary consisting of categories and subcategories, used for classifying information. Term frequency â inverse document frequency (tf-idf). A statistic used in text mining that represents the relative importance of different words in a body of content. The term frequency is calculated as the number of occurrences of a term within a document divided by the total number of words in the document. The inverse document frequency for a term is the total number of documents divided by the number of documents containing the term. The tf-idf statistic is calculated by multiplying the term frequency by the inverse document frequency. Text analytics. Techniques that utilize software and semantic resources to add structure to text-based content objects (e.g., text files, Word documents, and websites). The main capabilities of text analytics include text mining, sentiment analysis, entity or noun phrase extraction, auto-summarization, and auto-categorization. Text mining. A set of techniques used to extract information from text; includes tokenization, lemmatization and stemming, removal of stop words and punctuation, part of speech tagging, mapping of word frequencies, analysis of word co-occurrences, and cluster analysis. Tokenization. The process of segmenting running text into words, phrases, and sentences. Tokenization is required before text processing can be done by a computer. Segmentation of words and phrases is done to detect meaningful patterns of words that will be used by other natural language algorithms to compute the meaning of the text.