Using Language Modelling to Integrate Speech Recognition with a Flat Semantic Analysis

Using Language Modelling to Integrate Speech Recognition with a Flat Semantic Analysis Dirk Buhler¨ , Wolfgang Minker, Artha Elciyanti Department of Information Technology, University of Ulm, Germany fbuehler,minker,[email protected] Abstract or word lattices as output representations, but these may require more complex NLU processing and/or increased One-stage decoding as an integration of speech processing times. In this paper, we follow an alternative recognition and linguistic analysis into one approach: integrating flat HMM-based semantic analy- probabilistic process is an interesting trend in sis with the speech recognition process, resulting in a speech research. In this paper, we present a one-stage recognition system that avoids hard decisions simple one-stage decoding scheme that can be between ASR and NLU. The resulting system produces realised without the implementation of a spe- word hypotheses where each word is annotated with a cialized decoder, nor the use of complex lan- semantic label from which a frame-based semantic repre- guage models. Instead, we reduce an HMM- sentation may easily be constructed. Fig. 1 sketches the based semantic analysis to the problem of de- individual processes involved in our integrative approach. riving annotated versions of the conventional The shaded portions in the figure indicate the models and language model, while the acoustic model re- processing steps that will be modified by versions using mains unchanged. We present experiments semantic labels. This will lead to an overall architecture, with the ATIS corpus (Price, 1990) in which where a separate semantic decoding step (5) becomes dis- the performance of the one-stage method is pensable. shown to be comparable with the traditional One contribution of this work is to show that compared two-stage approach, while requiring a signifi- to other one-stage approaches (Thomae et al., 2003) such cantly smaller increase in language model size. an integrated recognition system does not require a spe- cialized decoder or complex language model support. In- 1 Introduction stead, basic bi-gram language models may be used. In a spoken dialogue system, speech recognition and lin- We achieve the integration by “reducing” the NLU guistic analysis play a decisive role for the overall per- part to language modelling whilst enriching the lexicon formance of the system. Traditionally, word hypothe- and language model with semantic information. Conven- ses produced by the automatic speech recognition (ASR) tional basic language modelling techniques are capable of component are fed into a separate natural language under- representing this information. We redefine the units used standing (NLU) module for deriving a semantic meaning in the language model: instead of using “plain” words, representation. These semantic representations are the these are annotated with additional information. Such an system’s understanding of the user’s intentions. Based on additional information may consist of semantic labels and this knowledge the dialogue manager has to decide on the context information. For each of these annotated variants system reaction. Because speech recognition is a proba- of a word, the phonetic transcription of the “plain” word bilistic pattern matching problem that ususally does not is used. Consequently, the ASR cannot decide which generate one single possible result, hard decisions taken variant to choose on the basis of the acoustic model. No after the speech recognition process could cause signifi- retraining of the acoustic model is necessary. The speech cant loss of information that could be important for the recogniser produces word hypotheses enriched with se- parsing and other subsequent processing steps and may mantic labels. thus lead to avoidable system failures. One common The remainder of this paper is structured as follows: way of avoiding this problem is the use of N-best lists In the next section we give a brief overview of the Cam- s u p reference word: case frame or concept identi®er r Training Transcription Word Semantic o Speech Texts Classes C Labelling case frame: set of cases related to a concept g labelled n i n case: attribute of a concept i a r T case marker: surface structure indicator of a case 2 4 case system: complete set of cases of the application 1 Dictionary g labelled 2 n Semantic i n Model i Figure 2: Semantic case grammar formalism. a Acoustic r Language T Model Model labelled to produce word hypotheses lattices or N-best lists of 3 W ord 5 Result t hypotheses s e recognition results. Internally this word network is com- T 3 Result bined with a phonetic transcription dictionary to produce s an expanded network of phoneme states. Usually, one u p Plain Data r Test o Speech phoneme or triphone is represented by five states. C Labelled Data t s e T For our experiments with the ATIS corpus, the acoustic model is constructed in conventional way. We use 4500 Figure 1: Principal knowledge sources and models of utterances to train a triphone recogniser with 8 Gaussian speech recognition and semantic analysis. Shaded parts mixtures. A triphone count of 5929 physical triphones constitute the changes when using a one-stage approach. expand to 27619 logical ones. The acoustic model is used The numbers indicate the following computational steps: for both the two-stage and the one-stage experiments. (1) acoustic model parameter estimation, (2) language modelling, (3) Viterbi acoustic decoding, (4) semantic 3 HMM-Based Semantic Case Frame model parameter estimation, (5) Viterbi semantic decoding. Analysis In the domain of spoken language information retrieval, bridge HTK software we used for our experiments with spontaneous effects in speech are very important (Minker, the ATIS corpus. In Section 3 we outline the HMM-based 1999). These include false starts, repetitions and ill- parsing method. The basic approach for adding infor- formed utterances. Thus it would be improvident to base mation into the speech recogniser language model is de- the semantic extraction exclusively on a syntactic anal- scribed in Section 4. In Section 5 we discuss our experi- ysis of the input utterance. Parsing failures due to un- ments and present speech recognition results. Finally, we grammatical syntactic constructs may be reduced if those conclude by pointing out further possible improvements. phrases containing important semantic information could be extracted whilst ignoring the non-essential or redun- 2 Acoustic Modelling and Speech dant parts of the utterance. Restarts and repeats fre- Recognition Using HTK quently occur between the phrases. Poorly syntactic constructs often consist of well-formed phrases which are se- Speech recognition may be formulated as an optimisation mantically meaningful. problem: Given a sequence of observations O consist- One approach to extract semantic information is based ing of acoustic feature vectors, determine the sequence on case frames. The original concept of a case frame of words W , such that it maximizes the conditional prob- as described by Fillmore (Fillmore, 1968) is based on a ability P (W jO). Bayes’ rule is used to replace this con- set of universally applicable cases or case values. They ditional probability which is not directly computable by express the relationship between a verb and its nouns. the product of two components: P (OjW ), the acoustic Bruce (Bruce, 1975) extended the Fillmore theory to any model, and P (W ), the language model. concept-based system and defined an appropriate semantic grammar whose formalism is given in Fig. 2. [W ] = argmax fP (W )P (OjW )g (1) opt W In the example query The Cambridge Hidden Markov Model Toolkit <you> <get> could you give me a ticket price (HTK) (Young et al., 2004) can be used to build robust on [uh] [throat clear] a flight first class speaker-independent speech recognition systems. The from San Francisco to Dallas please tied acoustic model parameters are estimated by the forward-backward algorithm. a typical semantic case grammar would instantiate the The HTK Viterbi decoder can be used together with a following terminals: probabilistic word network that may be computed from a finite state grammar or the bi-gram statistics of a text • price: this reference word identifies the concept air- corpus. The decoder’s token passing algorithm is able fare (other concepts may be: book, flight, ...) • from: case marker of the case from-city correspond- An HMM-based parsing module may be conceived as a ing to the departure city San Francisco probabilistic finite state transducer that translates a sequence of words into a sequence of semantic labels. The • to: case marker of the case to-city corresponding to semantic labels denote word’s function in the semantic the arrival city Dallas representation. • class: case marker of the case flight-class corre- Although the flat semantic model has known limita- sponding to first tions with respect to the representation of long-term de- pendencies, for practical applications it is often sufficient. • case system: from, to, class, ... It has been shown that several methods, such as contex- tual observations and garbage models, exist that enhance The parsing process based on a semantic case grammar the performance of HMM-based stochastic parsing mod- typically considers less than 50% of the example query els (Beuschel et al., 2004). to be semantically meaningful. The hesitations and false starts are ignored. The approach therefore appears well 4 Adding Information to the Language suited for natural language understanding components where the need for semantic guidance in parsing is es- Model pecially relevant. As mentioned above, the language model P (W ) repre- Case frame analysis may be used in a rule-based case sents the probability of a state sequence. With the bi- grammar. Here, we apply HMM-based modelling in- gram approximation P (W ) ≈ P (wijwi−1) this proba- stead (Pieraccini et al., 1992; Minker et al., 1999).

Load more