<<

A. Speech Understand ing Systems Overview

A major objective of current computer science and engineering efforts is to obtain more comfortable, efficient, and natural interfaces between people and their computers. Since speech is our most natural communication modality, using spoken language for access to computers has become an important research goal. There are several specific advantages of speech as an input medium: The computer users, especially "casual" users, would need less training before interacting with a complex system; interactions with the system would be quicker, since speech is our fastest mode of communication (about twice the speed of the average typist); and the computer user's hands are free to point, manipulate the display, etc. Speech input capability is particularly important in environments that place many simultaneous demands on the user, as in aircraft space flight operations.

The speech understanding task is best described in terms of the types of information at varous "levels" that are used in processing. From the "bottom" up, these are:

(b) templates, the representations of acoustic patterns to be matched against the signal;

phonetics, representations of the sounds in all of the words in the vocbulary, including variations in pronunciation that appear when words are spoken together in sentences (coarticulation across word boundaries, prosodic fluctuation in stress and intonation across the sentence, etc. );

(d> lexicon, the dictionary of words in the system's vocabulary;

(e) syntax, the grammar or rules of sentence formation, resulting in important constraints on the number of sentences —not all combinations of words in the vocabulary are legal sentences;

semantics, the "meaning" of sentences, which can also be viewed as a constraint on the speech understander—not all grammatically legal sentences have a meaning—e.g., "The snow was loud"; and (g> pragmatics, the rules of conversation— in a dialogue a speaker's response must not only be a meaningful sentence, but must also be a reasonable reply to what was said to him.

Most speech understanding systems (SU3) can be viewed in terms of a "bottom end" and a "top end" (Klatt, 1977). The task of the bottom end is to use lexical and phonetic knowledge to recognize pieces of the speech signal by comparing the signal with pre-stored patterns. The top end aids in recognition by bui lding expectations about which words the speaker is likely to have said, using syntactic, semantic and pragmatic constraints. In some systems, the top end is also responsible for deciding what the utterance means, once it is recognized, and responding appropriately. Top-down processing, based on predicting what words will be in the utterance (from the 2 A I Handbook

context and other words that have already been recognized), is an important feature of some systems which actually can respond without understanding every word that was said.

The Problem: Understanding Continuous Speech

Several early isolated-word recoqnit ion systems in the 1960s preceded the work on speech understanding systems of the early 19705. The technique used by these isolated-word systems was to compare the acoustical representation of each word in a relatively small vocabulary to the speech signal and select the best match, using a variety of "distance" metrics. The Vincens-Reddy system was one of the first Isolated-word systems to be successfully demonstrated (Vincens, 1969). Until quite recently, these systems would cost in the tens of thousands of dollars and offer about 95% accuracy on a small vocabulary. This methodology has recently been refined to produce commercially available products for isolated recogntion of up to 100 words, costing under *1,000, although the utility of these products has yet to be demonstrated.

Unfortunately, continuous speech signals could not be handled with the same techniques: Ambiguities in human speech and erroneous or missing data in the voice signal preclude obtaining a simple, complete of steps for the direct transformation of acoustic signals into sentences. Non-uniformities are introduced into the acoustic signal in several ways: First, the microphone and background noise of the environment introduce interference in the recording of the spoken utterance. Second, a given speaker does not pronounce the same words quite the same way every time he speaks: Even if the program is "tuned" to one speaker, the matching process between the phonetic characteristics of words (the templates) and the actual utterance is inherently inexact. Third, the pronunciation of individual words change when they are juxtaposed to form a sentence. Lastly, the words are not separated in the voiced sentence. On the contrary, whole syllables are often "swallowed" at word boundaries. Taken together, these factors imply that the basic acoustic signal, which is the foundation for the rest of the processing, does not look at all like a concatenation of signals of individual words.

The difficulties introduced in recognizing "continuous" speech required a new viewpoint on the methodology. Researchers speculated that e xpec tat ions about the form of the utterances could be gleaned from contextual information, such as the grammar and current topic of conversation, and these expectations could be used to help identify the actual content of the signal. Thus, the task came to be viewed as one of interpretation of acoustic signals, in light of knowledge about syllables, words, subject matter, and dialogs, and carried with it the problem of organizing large amounts of diverse types of knowledge. The way that the speech systems enable communication between the "sources" of these various types of knowledge is one of the most interesting aspects of their design.

The ARPA Speech Understand ing Research Program

In the early 19705, the Advanced Research Projects Agency of the US Department of Defense decided to fund a five-year program in speech Speech Understanding Systems 3

understanding research aimed at obtaining a breakthrough in speech understanding capability. A study group met in 1971 to set guidelines for the project (Newell et al. , 1971). This group of scientists set specific performance criteria for each dimension of system inputs and outputs: The systems were to accept normally spoken sentences (connected speech), in a constrained domain with a 1000-word vocabulary, and were to respond reasonably fast ("a few times real-time" on current high-speed computers) with less that 10% error. This was one of the few times that Al programs had had design objectives set before their development. Setting these standards was important since they closely approximated the level of minimal requirements for a practical connected speech understanding system in a highly constrained domain (although producing a practical system was notably not a goal of the ARPA program). The final performance of the ARPA SUS projects will be discussed later in this article.

A number of different designs were explored during the ARPA effort, from 1971 to 1976, including alternative control structures and knowledge representations for transforming the speech signal into a representation of the meaning of the sentence, from which an intelligent response could be generated. These systems employed several types, of knowledge including a vocabulary and syntax tailored to the task area. In a typical system, a series of complicated transformations are applied to the analog signal, which represents the intensity of the utterance as recorded by the microphone, resulting in a compact digital encoding of the signal.

Some Important Design Ideas

Futher processing of this signal was the task of the systems developed in the ARPA program. These systems will be described in more detail later; but first some general design considerations should be illuminated. In many ways these ideas, which are übiquitous in Al research, appear especially clearly in the context of the large speech understanding systems.

Top-down and Bottom-up Processing. As mentioned above, the ARPA systems all used "top-end" knowledge about "likely" utterances in the domain to help identify the contents of the speech signal. Knowledge about the "grammatical" sentences (syntax), the "meaningful" sentences (semantics), and the "appropriate" sentences at each point in the dialogue (pragmatics) was used. For example, consider the HEARSAY-I speech system (article Bl_), which played "voice chess" with the speaker by responding to the moves he speaks into the microphone, using a chess program (see Search. Heuristics in Games) to figure out the best response. Not only did HEARSAY-I use syntactic knowledge about the specific format of chess moves (e. g, "pawn to king-4") to anticipate the form of incoming utterances, but it even used its chess program's legal-move generator to suggest moves that were likely to be tried by the opponent— then examined the speech signal for those particular moves. The importance of top-down, predictive processing has also been pointed out by workers in natural language (NL) understanding research (see Natural r Language, especially the articles on MARGIE and SAM). In NL research, the | problem of word recogni tion is avoided, since the words are typed into the J system# tTut in to determine the meaning of the input, so that an J appropriate response can be evoked, much knowledge about the world must be used by the system. Similarly, in vision research where, as in speech, the 4 AI Handbo o k

task is one of recognition as well as understanding, a strong model of the physical world, as well as knowledge about what things the camera is likely to find, are typically used to help figure out what- is in the scene (see Vision) . is. It is generally agreed that this constraining knowledge -waa necessary for adequate performance in the speech understanding task: Without expectations about what to look for in the input, the task of identifying what-^f there is impossible. "Experiments" with several systems demonstrated the effect of removing syntactic and semantic constraints on the processing of the signal. The system, which combined all of the phonetic, syntactic, and semantic knowledge into one integrated network, was 97% accurate in actually identifying the words in the utterance, even though it showed only 42% accuracy in phonetic seqmeT^tat ion. In other words, since "top end" knowledge about what words ktor4 allowed to follow others was incorporated in the network, HARPY could often guess the right words even when it didn't have an accurate phonetic interpretation of the signal. In the HEARSAY-I system, where the phonetic, syntactic, semantic, etc. , knowledge was separated into independent subsystems (called knowledge sources ) , a more convincing kind of experiment could be performed. The system was designed so that it could run with only some of the knowledge sources "plugged in." The performance of HEARSAY-I improved by 25% with the addition of the syntactic knowledge source and by another 25% with the addition of the semantic knowledge source (Lea, 1978).

General itu vs. Power. The way that top-down processes are used to constrain the content of expected sentences reflects an important universal issue in Al systems design. The top end of the speech systems contains knowledge about a specific language (grammar) and a specific domain. In the development of all of the speech understanding systems, general grammatical knowledge gave way to grammars that were very specific to the task requirements (called performance grammars ) , making use of the structure of the stereotypical phrases used in a particular task domain (Robinson, 1975). The trade-off between domain independence and the power needed to restrain hypothesis formation is an important idea i"AAI research.

Cooperating Knowledge Sources. Another major design idea was to have the various types o? knowledge (phonetics, syntax, semantics, etc. ) separated into independent knowledge sources. These, in turn, communicate with each other about the possible interpretations of the incoming signal. Typical of this design were the HEARSAY systems developed at Carnegie-Mellon University (CMU). In the HEARSAY model, the knowledge sources were theoretically to know nothing about each other, not even of each other's existence. They were thought of as independent processes that looked at the signal and generated hypotheses _abjjHJt words, phrases, etc. , on the b lac kboard —a global data- structure 523; Jfll of the knowledge sources l**d axrR

The advantages of the HEARSAY model were those generally associated with modular i zation of knowledge (see Representation) : adding, modifying, or

* t%i* s>ecf\vifc> tt*k f*c eftxfH £o /c-te^o ay -e^H^K 6fcis% e*ii*fe*ace Speech Understand ing Sustems 5 removing a knowledge source could theoretically be accomplished without necessitating changes in the other knowlege sources. In a multiprocessor environment, where the different knowledge sources are running as processes on different machines, the system might be less sensitive to transient failures of processors and communication links—graceful degradation, as it is called.

**■*#"*#"»"■* ■*"»"*#*#*■«"#■»"■ ##"*"*"»"#"*■*■«"##

There must be more to say about this. More advantages? Maybe pull in stuff from DESIGN article. Any particular ref. for multi-processor HEARSAY?

####"*#*#*«■*#####■*##"*#"**###■*####*■*####*#

Comp i led knowledge. The other principle type of speech processing system is one where the knowledge about all of the sentences that are meaningful in the task domain are precomp i led into one network. The nodes in the network are phonetic templates that are to be matched against the voice signal. The links in the network are used to control the matching process: After a successful match at node N, only the nodes that are linked in the net to node N need be tried next —no other sound is "legal" at that point. The processing of the input signal starts at the beginning of the utterance, matching the first sound against the "start" nodes of the net, and proceeds left-to-right through the network, trying to find the "best path" according to some matching metric. Since all paths through the net are legal sentences, the best path corresponds to the system's best interpretation of the utterance. The system, and its successor HARPY, both developed at CMU, are examples of the compiled knowledge type of system.

##■*■*####"**■*#■»#"*«*■**#"»###"*##■«*#*##*■»■■»##*■

Larry indicates that focus of attention (island driving) and the use of pragmatic knowledge should be included in this discussion of design issues.

He also points out the the CMU systems get mentioned to the exclusion of the BBN and SRI systems.

The Status of Speech Understanding Research

The HARPY system was the best performer at the end of the ARPA program. The performance requirements established by the working committee and the final results of the HARPY system are compared below (after Klatt (1977)): 6 A I Handbook

GOAL (November, 1971) HARPY (November, 1976)

Accept connected speech Yes from many 5 (3 male, 2 female) cooperative speakers yes in a quiet room computer terminal room using a good microphone close-talking microphone with slight tuning/speaker 20-30 sentences/talker accepting 1000 words 1011 word vocabulary using an artificial syntax average branching factor = 33 in a constraining task document retrieval yielding < 10 percent 5 percent semantic error requiring approximately requiring 28 MIPSS 300 million instructions of per second of speech (MIPSS). * using 256K of 36-bit words costing *5. 00 per sentence * The actual specification stated "a few times real-time" on a 100 MIPSS mach Comparing the performance of the various systems at the termination of the ARPA project (September, 1976) is complicated by several factors. A standard set of utterances was not prepared for testing, since each system used a different task domain. The task domains, which included document retrieval (HARPY, HEARSAY-II), answering questions from a database (SRI system and BBN's HWIM), and voice chess (HEARSAY-I) had a range of difficulties (as measured by branching factor) from 33 for HARPY to 196 for HWIM Systems also varied according to the number of speakers and amount of room noise that could be accommodated, and the amount of tuning required for each new speaker.

For example, in a variety of tasks, DRAGON recognized from 63-94% of the words and from 17-68% of the utterances with up to a 194-word vocabulary. This variation of results across domains displayed by DRAGON demonstrates the difficulty of specifying how well a system performs. It appears that the number of words in the lexicon alone is an inadequate measure of the complexity of the understanding task. The CMU speech group proposes a measurement termed average branch inq factor, based on the number of sounds that can typically follow at each point in each legal sentence. For example, with DRAGON performance was better with a particular 194-word vocabulary than with another 37-word vocabulary (consisting of just the alphabet plus numbers, with a higher branching factor, due to the similiarity in phonemic structure of the 26 letters.

The complexity of the speech understanding task is also demonstrated by the impact of the problem on other areas of Al. During the recent speech efforts, new research in natural language, representation of knowledge, search, and control strategies has been required to deal with the objective of recognizing continuous speech in the presence of ambiguity. For a more detailed comparison and discussion, see Lea h. Shoup (1978b) and Klatt (1977). Speech Understanding Systems 7

Summar

Considerable progress towards practical speech understanding systems has been made during the 19705, and work in this area has generated ideas that influenced work in other areas of Al, such as vision and natural language understanding. Following is a summary of the conclusions of the same study group that established the requirements for the ARPA project at the beginning of the decade:

The gains go beyond empirical knowledge and engineering technique to basic scientific questions and analyses. A few examples: Network representations for speech knowledge at several levels have been created that have substantial generality. A uniform network representation for the recognition process has been developed. Rule schemes have been created to express certain phonological and acoustic-phonetic regularities.... Techniques have been found for measuring the difficulty and complexity of the recognition task. The problem of parsing (syntactic analysis) with unreliable or possibly missing words (so that one cannot rely on parsing left-to- right, but must be able to parse in either direction or middle-out from good word matches) has been successfully analyzed. New paradigms have been developed for many of the component analysis tasks and for the control structure of intelligent systems. Substantial progress has been made on understanding how to score performance in a multi-component system, how to combine those scores, and how to order search priorities (Medress et al., 1977).

Most of the principal Speech Understanding Systems are described in the articles in this chapter. Many of the ARPA contractors produced multiple systems during this time period. Work at Bolt, Beranek and Newman, Inc. produced first the SPEECHLIS and then the HWIM system, using previous BBN research in natural language understanding (see Natural Language. LUNAR). Carnegie-Mellon University produced the HEARSAY-I and DRAGON systems in the early development phase (1971-1973), and the HARPY and HEARSAY-II programs in the later stage. SRI International also developed several speech understanding programs, partly in collaboration with the Systems Development Corporation. Additional systems, not reported in this chapter, include work at Lincoln Labs, Univac, and IBM. The current IBM system utilizes the dynamic programming approach explored in the DRAGON system and is the most active speech understanding project under development since the end of the ARPA program (Bahl et al. , 1978).

References

The recent book by Lea (1978) contains the best comparative overview of the ARPA speech systems, as well as detailed articles on the systems themselves written by their designers. Other good summary articles are Lea & Shoup (1978b), and Klatt (1977). For a popular account of the ambiguites inherent at the phonetic level of encoding, see Cole (1979). Descriptions of early speech research and the goals of the ARPA program are in Newell (1973) and Reddy (1975). 8 Al Handbook

References

lEEE Transactions on Acoustics, Speech, and Signal Processing, February 1975 (a special issue on Speech Understanding). lEEE7S

Bahl, L. R. , Baker, J. K. , Cohen, P. S. , Cole, A. G. , Jelinek, F. , Lewis, B. L. , i>. Mercer, R. L. Automatic recognition of continuously spoken sentences from a finite state grammar. Proceed inq of the 197S lEEE International Conference on Acoustics, Speech, and Signal Processing, Tulsa, Oklahoma, 1978, pp. 418-421. Bahl7S

Baker, J. K. The Dragon System An Overview. lEEE ASSP, February 1975 ASSP-23( 1 ) , 24-29. BAKE7S

Cole, Ronald A. Navigating the slippery stream of speech. Psychology Today, 12:11, April 1979, 77-87. c01e79

de Mori, R. On speech recognition and understanding. In K. S. Fu & A. B. Whinston (Eds. ) Pattern recognition, theory and appl ication. Leyden: Nordhoff, 1977. Pp. 289-330. demori77

de Mori, R. Recent advances in automatic speech recognition. Proc. of the 4th Int. Jo int Conf. on Pattern Recognition, Kyoto, Japan, November 1978. demori7B

Erman, L. D. A functional description of the Hearsay-II speech understanding system. Speech Understand ing Systems, Summary of Results of the Five-year Research Effort at Carneg ie-Mel lon University, CMU Computer Science Tech. Report, August 1977. (Also: lEEE Conf. of Acoustics, Speech, and Signal Processing, Hartford, Conn. , May 1977. ) Erm77

Erman, L. D. , & Lesser, V. R. The HEARSAY-II speech understanding system: A tutorial. In W. A. Lea (Ed. ) Trends in Speech Recognition Eng lewood Cliffs, N. J. : Prentice-Hall. In press, 1978. (a) Erman 7Ba

Erman, L. D. , & Lesser, V. R. System engineering techniques for artificial intelligence systems. In A. Hanson &-. E. Riesman (Eds. ), Computer vision systems. New York: Academic Press. In press, 1978. (b) Erman 7Bb

Hayes-Roth, F. , & Lesser, V. Focus of Attention in the Hearsay-I I Speech Understand ing System, CMU Computer Science Tech. Report, Carnegie- Mellon University, January 1977. Hay e77

Klatt, D. H. Review of the ARPA Speech Understanding Project. Journal of the Acoustical Soc iety of America, 1977, 62, 1345-1366. KlatF7f

Lea, W. A. (Ed. ) Trends in Speech Recognition. Englewood Cliffs, N.J. Prentice-Hall. In press, 1978. Lea7B Speech Understand inq Systems 9

Lea, W. A. , Z>. Shoup, J. E. Review of the ARPA SUR Project and survey of the speech understanding (Final report on ONR Contract No. NOOOl4-77- C-0570). Speech Communication Research Laboratory, Santa Barbara, CA, 1978. (a) Lea7Ba

Lea, W. A., %>. Shoup, J. E. Specific contributions of the ARPA SUR Project. In W. A. Lea (Ed. ), Trends in Speech Recognition. Englewood Cliffs, N. J. : Prentice-Hall. In press, 1978. (b) Lea7Bb

Lesser, V %>. Erman, L. D. A Retrospective View of the HEARSAY-II Architecture. IJCAI 5, 1977, 790-800. Less77

Lowerre, B. , & Reddy, R. The Harpy Speech Understanding System. In W. A. Lea (Ed. ), Trends in Speech Recognition. Englewood Cliffs, N.J. Prentice-Hall, 1978." LQWE7B

Lowerre, B. The Harpy Speech Recognition System, Doctoral dissertation, CMU Computer Science Tech. Report, Carnegie-Mellon University, 1976. Lowe 76

Medress, M. F. , et al. Speech understanding systems: Report of a steering committee. Siqart Newsletter, 1977, 62(April), 4-8. (Also in "CArtificial Intel 1 igence>, 1977, 9, 307-316. ) Medress77

Nash-Webber, B. L. The role of semantics in automatic speech understanding. In D. Bobrow %>. A. Collins (Eds. ), Representation and Understand inq : Stud ies in Cognitive Sc ience. New York : Academic Press, 1975. Nash7s

Newell, A. A Tutorial on Speech Understanding Systems. In D. R. Reddy (Ed. ), Speech Recognition: Invited Papers Presented at the 1974 lEEE Symposium. New York: Academic Press, 1975. Pp. 3-54. NEWE7S

Newell, A. HARPY, Production Systems and Human Cognition, CMU-CS-78-140. Dept of Computer Science, Carnegie-Mellon University, 1978. NEWELL7B

Newell, A., Barnett, J., Forgie, J., , C. , Klatt, D. H. , Licklider, J. C. R. , Munson, J. , Reddy, D. R. , & Woods, W. A. Speech Understanding Systems: Final Report of a Study Group, Carnegie-Mellon University, 1971 (reprinted by American Elsevier, Amsterdam, North- Holland, 1973). Newe7l

Newell, A., Barnett, J., Horgie, J. Green, C. , Klatt, D. , Licklider, J. C. R. , Munson, J. , Reddy, R %> Woods, W. Speech Understand inq Systems: Final Report of §L Study Group. Amsterdam: North Holland/American Elsevier, 1973 (originally published in 1971). Newe73

Paxton, W. H. A Framework for Language Understanding, SRI Tech. Note 131, Al Center, SRI International, Inc., Menlo Park, Calif., June 1976. Paxt76 10 Al Handbook

Reddy, R. , Erman, L. , Fennell, R. , Z>. Neely, R. The HEARSAY Speech Understanding System: An Example of the Recognition Process. IJCAI 3, 1973, IS3-193. (Reprinted in lEEE-Transactions computers, 1776, c-SJ, 427-431. ) REDD73

Reddy, R. , et al. Speech Understand ing Systems, Summary of. Results of. the Five-year Research Effort at Carnegie-Mellon Universi t CMU Computer Science Tech. Report, Carnegie-Mellon University, August 1977. spee77

Reddy, R. (Ed.) Speech Recognition: Invited Papers of the lEEE Symposium. New York: Academic Press, 1975. Redd7s

Reddy, R. Speech Recognition by Machine: A Review. Proceed inqs of the lEEE,. 1976, 64, 501-531. Redd76

Robinson, Jane J. Performance Grammars. In D. Raj Reddy (Ed. ), Speech Recognition: Invited Papers of the lEEE Symposium, New York: Academic Press, 1975, 401-427. rob inson7s

Smith, A. R. Word hypothesi zation for large-vocabulary speech understandinq systems. Doctoral dissertation, Carnegie-Mellon University (also available as Tech. Rep. ), 1977. Smith77

Vincens, P. Aspects of Speech Recognition by Computer. Ph. D. Thesis, Computer Science Department, Stanford University, 1969. vincens69

Walker, D. E. (Ed. ) Understanding Spoken Language. New York North Holland, 1978. Walk 7B

Woods, W. A. SPEECHLIS: An Experimental Prototype for Speech Understanding Research. lEEE ASSP, February 1973, ASSP-23(1 ), 2-10. W00d 73

Woods, W. A. , et al. Speech Understanding Systems, Final Report (BBN Report 3438). Cambridge: Bolt, Beranek, & Newman, November 1974- October 1976- (Vols. 1-5). w00d74 Speech Understand inq Systems 11

Index

blackboard 4 chess program 3 graceful degradation 5 isolated-word recognition 2 knowledge sources 4 modularization 4 performance grammars 4 phonetics 1 pragmatics 3 prosodic 1 prosodies 1 semantics 3 signal 1 speech understanding 1 syntax 3 template 1