Using Dialogue Representations for Concept-To-Speech Generation

Using Dialogue Representations for Concept-to-Speech Generation Christine H. Nakatani Jennifer Chu-Carroll Bell Laboratories, Lucent Technologies 600 Mountain Avenue Murray Hill, NJ 07974 USA {chn Ij encc}©research, bell-labs, com Abstract Grosz and Sidner (1986) computational model of dis- We present an implemented concept-to-speech course interpretation and Pierrehumbert's prosodic (CTS) syst@n'~J tl~at offers original proposals for grammar for American English (1980). certain couplings-oir dialogue computation with In the present work, certain aspects of the orig- prosodic computation. Specifically, the semantic in- inal theories are modified and adapted to the ar- terpretation, task modeling and dialogue strategy chitecture of the dialogue system in which the CTS modules in a working spoken dialogue system are component is embedded. Below, we present the im- used to generate prosodic features to better convey portant fundamental definitions and principles of in- the meaning of system replies. The new CTS system tonation underlying our CTS system. embodies and extends theoretical work on intona- 2.1 Intonational System tional meaning in a more general, robust and rigor- In our CTS system, the prosodic elements that are ous way than earlier approaches, by reflecting com- computed are based on the intonational system of positional aspects of both dialogue and intonation Pierrehumbert (1980), who defined a formal lan- interepretation in an original computational frame- guage for describing American English intonation work for prosodic generation. using the following regular grammar: 1 Introduction Inton Phrase ---~ (Interm Phrase) + Bndry Tone Conversational systems that use speech as the input Interm Phrase ~ (Pitch Acc)+ Phrase Ace and output modality are often realized by architec- Major phrases, or inlonational phrases, are made tures that decouple speech processing components up of one or more minor phrases, or inlermediale from language processing components. In this pa- phrases. Melodic movements in intermediate and per, we show how speech generation can be more intonational phrases are in turn expressed by three closely coupled with the dialogue manager of a work- kinds of tonal elements. These include six pilch ac- ing mixed-initiative spoken dialogue system. In par- cents: a low pitch excursion (L*), a high pitch excur- ticular, we use representations from the semantic in- sion (H*), or a combination of both low and high ex- terpretation, task model and dialogue strategy mod- cursions (L*+H, L+H*, H*+L, It+L*); two phrase ules to better communicate the meaning of system accents: a high (H-) or low (L-) tonal target that replies through prosodically appropriate synthetic guides the interpolation of the melodic contour from speech. final pitch accent to intermediate phrase ending; and While dialogue prosody has been a topic of much two boundary lones: a high (H%) or low (L%) tonal study, our implemented concept-to-speech (CTS) target that guides interpolation from phrase accent system offers original proposals for specific couplings to intonational phrase ending. of dialogue computation with prosodic computation. Further, it embodies and extends theoretical work 2.2 Intonational Meaning on intonational meaning in a more general, robust Theoretical work on intonational meaning has at- and rigorous way than earlier CTS systems, in an tempted to relate the grammatical elements of Pier- architecture that reflects compositional aspects of rehumbert's system -- pitch accent, phrase accent dialogue and intonation interpretation. and boundary tone, to interpretive processes at dif- ferent levels of discourse and dialogue structure. 2 Theoretical Foundations Hirschberg and Pierrehumbert (1986) conjectured In this work, we implement and extend the com- that the absence or presence of accentuation conveys positional theory of intonational meaning proposed discourse focus status, while the tonal properties of by Pierrehumbert and Hirschberg (1986; 1990), the accent itself (i.e. pitch accent type) convey se- who sought to identify correspondences between the mantic focus information. 48 MIMIC: hello this is mimic the movie information system how can I help you User: where in hoboken is october sky playing MIMIC: october sky is playing at hoboken cinema in hoboken can I help you with anything else User: when is it playing there MIMIC: october sky is playing at hoboken cinema in hoboken at 3:45pm, 5:50pm, 7:lOpm, and lOpm can i help you with anything else Figure 1: A MIMIC dialogue. In later work, pitch accent type was said to ing information involving knowledge about towns, express whether the accented information was in- theaters, movies and showtimes, as demonstrated tended by the speaker to be "predicated" or not by in Figure 1. MIMIC currently utilizes template- the hearer (Pierrehumbert and Hirschberg, 1990). driven text generation, and passes on text strings Nompredicated-~forernation was said to bear low- to a stand-alone TTS system. In the version of star accentuation (L*-, L*+H, H+L*), while predi- MIMIC enhanced with concept-to-speech capabili- cated information would be marked by high-star ac- ties, MIMIC-CTS, contextual knowledge is used to cents (H*, L+H*, H*+L). The theory further stated modify the prosodic features of the slot and filler that L*+H conveys uncertainty or lack of speaker material in the templates; we are currently integrat- commitment to the expressed propositional content, ing the algorithms in MIMIC-CTS with a grammar- while L+H* marks correction or contrast. The com- driven generation system. Further details of MIMIC plex accent, H*+L, was said to convey that an infer- are presented in the relevant sections below, but see ence path was required to support the predication; (Chu-Carroll, 2000) for a complete overview. usage of H+L* similarly was said to imply an in- ference path, but did not suggest a predication of a 3.2 TTS: The Bell Labs System mutual belief. Finally, phrase accents and bound- For default prosodic processing and speech synthe- ary tones were said to reflect aspects of discourse sis realization, we use a research version of the structure. Bell Labs TTS System, circa 1992 (Sproat, 1997), that generates intonational contours based on Pier- 3 Systems Foundations rehumbert's intonation theory (1980), as described in (Pierrehumbert, 1981). Of relevance is the fact Our task is to improve the communicative compe- that various pitch accent types, phrase accent and tence of a spoken dialogue agent, by making re- boundary tones in Pierrehumbert's theory are di- course to our knowledge of intonational meaning, di- rectly implemented in this system, so that by gener- alogue processing and relations between the two. Of ating a Pierrehumbert-style prosodic transcription, course, a worthwhile CTS system must also outper- the work of the CTS system is done. More pre- form out-of-the-box text-to-speech (TTS) systems cisely, MIMIC-CTS computes prosodic annotations that may determine prosodic mark-up in linguisti- that override the default prosodic processing that is cally sophisticated ways. As in (Nakatani, 1998), we performed by the Bell Labs TTS system. take the prosodic output of an advanced research To our knowledge, the intonation component of system that implements the Pierrehumbert theory the Bell Labs TTS system utilizes more linguistic of intonation, namely the Bell Labs TTS system, knowledge to compute prosodic annotations than as our baseline experimental system to be enhanced any other unrestricted TTS system, so it is reason- by CTS algorithms. We embed the CTS system in able to assume that improvements upon it are mean- MIMIC, a working spoken dialogue system repre- ingful in practice as well as in theory. senting state-of-the-art dialogue management prac- tices, to develop CTS algorithms that can be eventu- 4 MIMIC's Concept-to-Speech ally realistically evaluated using task-based perfor- mance metrics. Component (MIMIC-CTS) In MIMIC-CTS, the MIMIC dialogue system is en- 3.1 Dialogue System: Mixed-Initiative hanced with a CTS component to better communi- Movie Information Consultant cate the meaning of system replies through contex- (MIMIC) tually conditioned prosodic features. MIMIC-CTS The dialogue system whose baseline speech gen- makes use of three distinct levels of dialogue rep- eration capabilities we enhance is the Mixed- resentations to convey meaning through intonation. Initiative Movie Information Consultant (MIMIC) MIMIC's semantic representations allow MIMIC- (Chu-Carroll, 2000). MIMIC" provides movie list- CTS to decide which information to prosodically 49 highlight. MIMIC's task model in turn determines Even such minimal use of dialogue information how to prosodically highlight selected information, can make a difference. For example, changing the based on the pragmatic properties of the system default accent for the following utterance highlights reply. MIMIC's dialogue strategy selection process the kind of information that the system is seeking, informs various choices in prosodic contour and ac- instead of highlighting the semantically vacuous centing that convey logico-semantic aspects of mean- main verb, like: 2 ing, such as contradiction. Default TTS: what movie would you LIKE 4.1 Highlighting Information using MIMIC-CTS: what MOVIE would you like Semantic Representations MIMIC employs a statistically-driven semantic in- 4.2 Conveying Information Status using terpretation engine to "spot" values for key at- the Task Model tributes that make up a valid MIMIC

Using Dialogue Representations for Concept-To-Speech Generation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support