Speech Understand Ing Systems Overview a Major Objective Of
Total Page:16
File Type:pdf, Size:1020Kb
A. Speech Understand ing Systems Overview A major objective of current computer science and engineering efforts is to obtain more comfortable, efficient, and natural interfaces between people and their computers. Since speech is our most natural communication modality, using spoken language for access to computers has become an important research goal. There are several specific advantages of speech as an input medium: The computer users, especially "casual" users, would need less training before interacting with a complex system; interactions with the system would be quicker, since speech is our fastest mode of communication (about twice the speed of the average typist); and the computer user's hands are free to point, manipulate the display, etc. Speech input capability is particularly important in environments that place many simultaneous demands on the user, as in aircraft or space flight operations. The speech understanding task is best described in terms of the types of information at varous "levels" that are used in processing. From the "bottom" up, these are: <a) signal, the utterance created by the speaker at a microphone and recoded in digital form for storage in the computer; (b) templates, the representations of acoustic patterns to be matched against the signal; <c> phonetics, representations of the sounds in all of the words in the vocbulary, including variations in pronunciation that appear when words are spoken together in sentences (coarticulation across word boundaries, prosodic fluctuation in stress and intonation across the sentence, etc. ); (d> lexicon, the dictionary of words in the system's vocabulary; (e) syntax, the grammar or rules of sentence formation, resulting in important constraints on the number of sentences —not all combinations of words in the vocabulary are legal sentences; <f> semantics, the "meaning" of sentences, which can also be viewed as a constraint on the speech understander—not all grammatically legal sentences have a meaning—e.g., "The snow was loud"; and (g> pragmatics, the rules of conversation— in a dialogue a speaker's response must not only be a meaningful sentence, but must also be a reasonable reply to what was said to him. Most speech understanding systems (SU3) can be viewed in terms of a "bottom end" and a "top end" (Klatt, 1977). The task of the bottom end is to use lexical and phonetic knowledge to recognize pieces of the speech signal by comparing the signal with pre-stored patterns. The top end aids in recognition by bui lding expectations about which words the speaker is likely to have said, using syntactic, semantic and pragmatic constraints. In some systems, the top end is also responsible for deciding what the utterance means, once it is recognized, and responding appropriately. Top-down processing, based on predicting what words will be in the utterance (from the 2 A I Handbook context and other words that have already been recognized), is an important feature of some systems which actually can respond without understanding every word that was said. The Problem: Understanding Continuous Speech Several early isolated-word recoqnit ion systems in the 1960s preceded the work on speech understanding systems of the early 19705. The technique used by these isolated-word systems was to compare the acoustical representation of each word in a relatively small vocabulary to the speech signal and select the best match, using a variety of "distance" metrics. The Vincens-Reddy system was one of the first Isolated-word systems to be successfully demonstrated (Vincens, 1969). Until quite recently, these systems would cost in the tens of thousands of dollars and offer about 95% accuracy on a small vocabulary. This methodology has recently been refined to produce commercially available products for isolated recogntion of up to 100 words, costing under *1,000, although the utility of these products has yet to be demonstrated. Unfortunately, continuous speech signals could not be handled with the same techniques: Ambiguities in human speech and erroneous or missing data in the voice signal preclude obtaining a simple, complete set of steps for the direct transformation of acoustic signals into sentences. Non-uniformities are introduced into the acoustic signal in several ways: First, the microphone and background noise of the environment introduce interference in the recording of the spoken utterance. Second, a given speaker does not pronounce the same words quite the same way every time he speaks: Even if the program is "tuned" to one speaker, the matching process between the phonetic characteristics of words (the templates) and the actual utterance is inherently inexact. Third, the pronunciation of individual words change when they are juxtaposed to form a sentence. Lastly, the words are not separated in the voiced sentence. On the contrary, whole syllables are often "swallowed" at word boundaries. Taken together, these factors imply that the basic acoustic signal, which is the foundation for the rest of the processing, does not look at all like a concatenation of signals of individual words. The difficulties introduced in recognizing "continuous" speech required a new viewpoint on the methodology. Researchers speculated that e xpec tat ions about the form of the utterances could be gleaned from contextual information, such as the grammar and current topic of conversation, and these expectations could be used to help identify the actual content of the signal. Thus, the task came to be viewed as one of interpretation of acoustic signals, in light of knowledge about syllables, words, subject matter, and dialogs, and carried with it the problem of organizing large amounts of diverse types of knowledge. The way that the speech systems enable communication between the "sources" of these various types of knowledge is one of the most interesting aspects of their design. The ARPA Speech Understand ing Research Program In the early 19705, the Advanced Research Projects Agency of the US Department of Defense decided to fund a five-year program in speech Speech Understanding Systems 3 understanding research aimed at obtaining a breakthrough in speech understanding capability. A study group met in 1971 to set guidelines for the project (Newell et al. , 1971). This group of scientists set specific performance criteria for each dimension of system inputs and outputs: The systems were to accept normally spoken sentences (connected speech), in a constrained domain with a 1000-word vocabulary, and were to respond reasonably fast ("a few times real-time" on current high-speed computers) with less that 10% error. This was one of the few times that Al programs had had design objectives set before their development. Setting these standards was important since they closely approximated the level of minimal requirements for a practical connected speech understanding system in a highly constrained domain (although producing a practical system was notably not a goal of the ARPA program). The final performance of the ARPA SUS projects will be discussed later in this article. A number of different designs were explored during the ARPA effort, from 1971 to 1976, including alternative control structures and knowledge representations for transforming the speech signal into a representation of the meaning of the sentence, from which an intelligent response could be generated. These systems employed several types, of knowledge including a vocabulary and syntax tailored to the task area. In a typical system, a series of complicated transformations are applied to the analog signal, which represents the intensity of the utterance as recorded by the microphone, resulting in a compact digital encoding of the signal. Some Important Design Ideas Futher processing of this signal was the task of the systems developed in the ARPA program. These systems will be described in more detail later; but first some general design considerations should be illuminated. In many ways these ideas, which are übiquitous in Al research, appear especially clearly in the context of the large speech understanding systems. Top-down and Bottom-up Processing. As mentioned above, the ARPA systems all used "top-end" knowledge about "likely" utterances in the domain to help identify the contents of the speech signal. Knowledge about the "grammatical" sentences (syntax), the "meaningful" sentences (semantics), and the "appropriate" sentences at each point in the dialogue (pragmatics) was used. For example, consider the HEARSAY-I speech system (article Bl_), which played "voice chess" with the speaker by responding to the moves he speaks into the microphone, using a chess program (see Search. Heuristics in Games) to figure out the best response. Not only did HEARSAY-I use syntactic knowledge about the specific format of chess moves (e. g, "pawn to king-4") to anticipate the form of incoming utterances, but it even used its chess program's legal-move generator to suggest moves that were likely to be tried by the opponent— then examined the speech signal for those particular moves. The importance of top-down, predictive processing has also been pointed out by workers in natural language (NL) understanding research (see Natural r Language, especially the articles on MARGIE and SAM). In NL research, the | problem of word recogni tion is avoided, since the words are typed into the J system# tTut in order to determine the meaning of the input, so that an J appropriate response can be evoked, much knowledge about the world must be used by the system. Similarly, in vision research where, as in speech, the 4 AI Handbo o k task is one of recognition as well as understanding, a strong model of the physical world, as well as knowledge about what things the camera is likely to find, are typically used to help figure out what- is in the scene (see Vision) . is. It is generally agreed that this constraining knowledge -waa necessary for adequate performance in the speech understanding task: Without expectations about what to look for in the input, the task of identifying what-^f there is impossible.