Speech Understanding in Open Tasks

SPEECH UNDERSTANDING IN OPEN TASKS Wayne Ward, Sunil lssar Xuedong Huang, Hsiao-Wuen Hon, Mei-Yuh Hwang Sheryl Young, Mike Matessa Fu-Hua Liu, Richard Stern School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ABSTRACT various types of actions that can be taken by the system. Slots in a frame represent the various pieces of information The Air Traffic Information Service task is currently used by relevant to the action that may be specified by the subject. DARPA as a common evaluation task for Spoken Language For example, the most frequently used frame is the one Systems. This task is an example of open type tasks. Subjects are corresponding to a request to display some type of flight given a task and allowed to interact spontaneously with the sys- information. Slots in the frame specify what information is tem by voice. There is no fixed lexicon or grammar, and subjects to be displayed (flights, fares, times, airlines, etc), how it is are likely to exceed those used by any given system. In order to to be tabulated (a list, a count, etc) and the constraints that evaluate system performance on such tasks, a common corpus of are to be used (date ranges, time ranges, price ranges, etc). training data has been gathered and annotated. An independent The Phoenix system uses recursive Iransition networks to test corpus was also created in a similar fashion. This paper specify word patterns (sequences of words) which cor- explains the techniques used in our system and the performance respond to semantic tokens understood by the system. A results on the standard set of tests used to evaluate systems. subset of tokens are considered as top-level tokens, which means they can be recognized independently of surround- ing context. Nets call other nets to produce a semantic 1. SYSTEM OVERVIEW parse tree. The top-level tokens appear as slots in frame structures. The frames serve to associate a set of semantic Our Spoken Language System uses a speech recognizer tokens with a function. Information is often represented which is loosely coupled to a natural language understand- redundantly in different nets. Some nets represent more ing system. The SPHINX-II speech recognition system complex bindings between tokens, while others represent produces a single best hypothesis for the input. It uses a simple stand-alone values. In our system, slots (pattern backed-off class bigram language model in decoding the specifications) can be at different levels in a hierarchy. input. This type of smoothed stochastic language model Higher level slots can contain the information specified in provides some flexibility when presented with unusual several lower level slots. These higher level forms allow grammatical constructions. The single best hypothesis is more specific relations between the lower level slots to be passed to the natural language understanding system which specified. For example, from denver arriving in dallas uses flexible parsing techniques to cope with novel phras- after two pm will have two parses, ings and misrecognitions. In addition to the basic speech [DEPART LOC] from [de part_loc] [city] den- recognition and natural language understanding modules, ver [ARRIVE_LOC] arnvmg In [arrive loc] we have developed techniques to enhance the performance [city ] dallas [DEPART_TIME ] of each. We have developed an environmental robustness [depart_time_range] after [start_time] module to minimize the effects of changing environments [time] twopm on the recognition. We have also developed a system to use a knowledge base to asses and correct the parses and produced by our natural language parser. We present each [DEPART LOC] from [depart loc] [city] den- of the modules separately and discuss their evaluation ver [ARRIVE] an/ving in [ar~ve loc] [city] results in order to understand how well the techniques per- dallas [a r rive_t ime_range ] after~s t a rt_t ime ] form. The authors on each line in the paper heading reflect [time] twopm those people who worked on each module respectively. The existence of the higher level slot [ARRIVE] allows this to be resolved. It allows the two lower level nets 2. FLEXIBLE PARSING [arrive loc] and [arrive_time_range] to be specifically associated. The second parse which has Our NL understanding system (Phoenix) is flexible at [arrive loc] and [arrive time] as subnets of several levels. It uses a simple frame mechanism to the slot [ARRIVE] is the preferred-interpretation.In pick- represent task semantics. Frames are associated with the ing which interpretation is correct, higher level slots are preferred to lower level ones because the associations be- 78 tween concepts is more tightly bound, thus the second 2.1. Natural Language Training Data (correct) interpretation is picked here. The simple heuris- tic to select for the interpretation which has fewer slots The frame structures and patterns for the Recursive Tran- (with the same number of words accounted for) allows the sition Networks were developed by processing transcripts situation to be resolved correctly. of subjects performing scenarios of the ATIS task. The The parser operates by matching the word patterns for data were gathered by several sites using Wizard tokens against the input text. A set of possible interpreta- paradigms. This is a paradigm where the subjects are told tions are pursued simultaneously. A subsumption algo- that they are using a speech recognition system in the task, rithm is used to find the longest version of a phrase for but an unseen experimenter is actually controlling the efficiency purposes. As tokens (phrases) are recognized, responses to the subjects screen. The data were submitted they are added to frames to which they apply. The algo- to NIST and released by them. There have been three sets rithm is basically a dynamic programming beam search. of training data released by NIST: ATIS0, ATIS1 and Many different frames, and several different versions of a ATIS2. We used only data from these releases in develop- frame, are pursued simultaneously. The score for each ing our system. A subset of this data (approximately 5000 frame hypothesis is the number of words that it accounts utterances) has been annotated with reference answers. for. At the end of an utterance the parser picks the best We have used only a subset of the ATIS2 data, including scoring frame as the result. all of the annotated data. The development test sets (for ATIS0 and ATIS1) were not included in the training. The parse is flexible at the slot level in that it allows slots to be filled independent of order. It is not necessary to represent all different orders in which the slot patterns 2.2. Natural Language Processing Results could occur. Grammatical restarts and repeats are handled by overwriting a slot if the same slot is subsequently recognized again. A set of 980 utterances comprised of 123 sessions from 37 speakers was set aside as a test set. Transcripts of these The pattern matches are also flexible because of the way utterances were processed by the systems to evaluate the the grammars are written. The patterns for a semantic performance of the Natural Language Understanding token consist of mandatory words or tokens which are modules. This will provide an upper bound on the perfor- necessary to the meaning of the token and optional ele- mance of the Spoken Language Systems, i.e. this ments. The patterns are also written to overgenerate in represents the performance given perfect recognition. The ways that do not change the semantics. This overgenera- utterances for sessions provided dialog interaction with a tion not only makes the pattern matches more flexible but system, not just the processing of isolated utterances. All also serves to make the networks smaller. For example, of the utterances were processed by the systems as dialogs. the nets are collapsed at points such that tense, number and For result reporting purposes, the utterances were divided case restrictions are not enforced. Articles A and AN are into three classes: treated identically. • Class A - utterances requiring no context for The slots in the best scoring frame are then used to build interpretation objects. In this process, all dates, times, names, etc. are mapped into a standard form for the routines that build the • Class D - utterances that can be interpreted database query. The objects represent the information that only in the context of previous utterances was extracted from the utterance. There is also a currently • Class X - utterances that for one reason or active set of objects which represent constraints from pre- another were not considered answerable. vious utterances. The new objects created from the frame are merged with the current set of objects. At this step Our results for processing the test set transcripts are shown ellipsis and anaphora are resolved. Resolution of ellipsis in Table 1. There were 402 utterances in Class A and 285 and anaphora is relatively simple in this system. The slots utterances in Class D for a combined total of 687 ut- in frames are semantic, thus we know the type of object terances. The remainder of the 980 utterances were Class needed for the resolution. For ellipsis, we add the new X and thus were not scored. The database output of the objects. For anaphora, we simply have to check that an system is scored. The percent correct figure is the percent object of that type already exists. of the utterances for which the system returned the (ex- actly) correct output from the database. The percent wrong Each frame has an associated function. After the infor- is the percent of the utterances for which the system mation is extracted and objects built, the frame function is returned an answer from the database, but the answer was executed.

Speech Understanding in Open Tasks

Semi-Continuous Hidden Markov Models for Speech Recognition

Acoustics, Speech, and Signal Processing, 1995

EFFICIENT CEPSTRAL NORMALIZATION for ROBUST SPEECH RECOGNITION Fu-Hua Liu, Richard M

Subphonetic Modeling for Speech Recognition Mei-Yuh Hwang Xuedong Huang

Challenges in Adopting Speech Recognition

The Computer and the Brain, and the Potential of the Other 95%

Big Data for Speech and Language Processing

Making a World of Difference

An Overview of Modern Speech Recognition

When Artificial Intelligence Can Transcribe Everything

Session 2: Language Modeling

Towards Human Parity in Conversational Speech Recognition