Skip to main content

Currently Skimming:

Integration of Speech with Natural Language Understanding
Pages 254-272

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 254...
... I then look at how systems cope with errors in speech recognition and at attempts to use natural language information to reduce recognition errors. Finally, I discuss how prosodic information in the speech signal might be used to improve understanding.
From page 255...
... This naive approach is less than ideal for a number of reasons, the most important being the following: · Spontaneous spoken language differs in a number of ways from standard written language, so that even if a speech recognizes were able to deliver a perfect transcription to a natural-languageunderstanding system, performance would still suffer if the natural language system were not adapted to the characteristics of spoken language. · Current speech recognition systems are far from perfect transcribers of spoken language, which raises questions about how to make natural-language-understanding systems robust to recognition errors and whether higher overall performance can be achieved by a tighter integration of speech recognition and natural language understanding.
From page 256...
... Among the most common types of nonstandard utterances in the data are sentence fragments, sequences of fragments. or fragments combined with complete sentences: O ~ six thirty a m from atlanta to son francisco what type of aircraft on the delta flight number ninety eight what type of aircraft i would like information on ground transportation city of boston between airport and downtown A particular subclass of these utterances might be dubbed "afterthoughts." These consist of an otherwise well-formed sentence followed by a fragment that further restricts the initial request: i'd like a return flight from denver to atlanta evening flights i need the cost of a ticket going from denver to baltimore a first class ticket on united airlines what kind of airplane goes from Philadelphia to san francisco monday stopping in dallas in the afternoon first class flight Another important group of nonstandard utterances can be classified as verbal repairs or self-corrections, in which the speaker intends that one or more words be replaced by subsequently uttered words.
From page 257...
... Finally, some utterances are simply ungrammatical: what kinds of ground transportation is available in dallas fort worth okay what type of aircraft is used on a flight between san francisco to atlanta what types of aircraft can i get a first class ticket from Philadelphia to dallas from those show me that serve lunch The first example in this list is a case of lack of number agreement between subject and verb; the subject is plural, the verb singular. The second example seems to confuse two different ways of expressing the same constraint; between san francisco and atlanta and from san ,francisco to atlanta have been combined to produce between son ,francisco to atlanta.
From page 258...
... Strategies for Handling Spontaneous Speech Phenomena Spoken-language-understanding systems use various strategies to deal with the nonstandard language found in spontaneous speech. Many of these phenomena, although regarded as nonstandard, are just as regular in their patterns of use as standard language, so they can be incorporated into the linguistic rules of a natural language system.
From page 259...
... In the case of SRI's systems, the combination of detailed linguistic analysis and robust processing seems to perform better than robust processing alone, with the combined Gemini+TM system having about four points better performance than the Template Matcher system alone for both speech and text input in the November 1992 ATIS evaluation, according to the weighted understanding error metric (Pallet et al., 1993~.: It should be noted, however, that the best-performing system in the November 1992 ATIS evaluation, the CMU Phoenix system, uses only robust interpretation methods with no attempt to account for every word of an utterance. The robust processing strategies discussed above are fairly general and are not specifically targeted at any particular form of disfluency.
From page 260...
... from Pittsburgh to san francisco on monday the section from san francisco no from Pittsburgh matches a pattern of a cue word, no, followed by a word (from) that is a repetition of an earlier word.
From page 261...
... It may be conjectured that this is due to the fact that most of the critical key words and phrases are very common in the training data for the task and are therefore well modeled both acoustically and in the statistical language models used by the recognizers. The degree of robustness of current ATIS systems to speech recognition errors can be seen by examining Table 1.
From page 262...
... NATURAL LANGUAGE CONSTRAINTS IN RECOGNITION Models for Integration Despite the surprising degree of robustness of current ATIS systems in coping with speech recognition errors, Table 1 also reveals that rates for understanding errors are still substantially higher with speech input than with text input, ranging from 1.26 to 1.76 times higher, depending on the system. One possible way to try to close this gap is to use information from the natural language processor as an additional source of constraint for the speech recognizes.
From page 263...
... The dilemma can be seen in terms of the kind of natural language system, discussed in the section "Strategies for Handling Spontaneous Speech Phenomena," that first attempts a complete linguistic analysis of the input and falls back on robust processing methods if that fails. If the grammar used in attempting the complete linguistic analysis is incorporated into the speech recognizes according to the standard model, the recognizes will be overconstrained and the robust processor will never be invoked because only recognition hypotheses that can be completely analyzed linguistically will ever be selected.
From page 264...
... If probabilistic models of natural language are not constructed in such a way that lexical association probabilities are captured, those models will likely be of little benefit in improving recognition accuracy. Architectures for Integration Whatever model is used for integration of natural language constraints into speech recognition, a potentially serious search problem
From page 265...
... The natural language processor similarly has to consider multiple recognition hypotheses, rather than a single determinate input string. Over the past 5 years, three principal integration architectures for coping with this search problem have been explored within the ARPA Spoken Language Program: word lattice parsing, dynamic grammar networks, and N-best filtering or rescoring.
From page 266...
... speech recognition architecture. In an HMM speech recognizer, a finite-state grammar is used to predict what words can start in a particular recognition state and to specify what recognition state the system should go into when a particular word is recognized in a given predecessor state.
From page 267...
... The standard model of speech and natural language integration can be implemented by N-best filtering, in which the recognizes simply produces an ordered list of hypotheses, and the natural language processor chooses the first one on the list that can be completely parsed and interpreted. More sophisticated models can be implemented by N-best rescoring, in which the recognition score for each of the N-best recognition hypotheses is combined with a score from the natural language processor, and the hypothesis with the best overall score is selected.
From page 268...
... SPEECH CONSTRAINTS IN NATURAL LANGUAGE UNDERSTANDING While most efforts toward integration of speech and natural language processing have focused on the use of natural language constraints to improve recognition, there is much information in spoken language beyond the simple word sequences produced by current recognizers that would be useful for interpreting utterances if it could be made available. Prosodic information in the speech signal can have important effects on utterance meaning.
From page 269...
... Perhaps the most significant conclusion is that natural-language-understanding systems for the ATIS task have proved surprisingly robust to recognition errors. It might have been thought a priori that spoken language utterance understanding would be significantly worse than utterance recognition, since recognition errors would be compounded by understanding errors that would occur even when the recognition was perfect.
From page 270...
... , "Integrating Speech and Natural-Language Processing," in Proceedings of the Speech and Natural Language Workshop, Philadelphia, Pa., pp. 243-247, Morgan Kaufman Publishers, San Mateo, Calif.
From page 271...
... (1993) , "Benchmark Tests for the DARPA Spoken Language Program," Proceedings of the ARPA Workshop on Human Language Technology, Plainsboro, N.J., pp.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.