Using Dialogue Representations for Concept-to-Speech Generation

Christine H. Nakatani Jennifer Chu-Carroll Bell Laboratories, Lucent Technologies 600 Mountain Avenue Murray Hill, NJ 07974 USA {chn Ij encc}©research, bell-labs, com Abstract Grosz and Sidner (1986) computational model of dis- We present an implemented concept-to-speech course interpretation and Pierrehumbert's prosodic (CTS) syst@n'~J tl~at offers original proposals for grammar for American English (1980). certain couplings-oir dialogue computation with In the present work, certain aspects of the orig- prosodic computation. Specifically, the semantic in- inal theories are modified and adapted to the ar- terpretation, task modeling and dialogue strategy chitecture of the dialogue system in which the CTS modules in a working spoken dialogue system are component is embedded. Below, we present the im- used to generate prosodic features to better convey portant fundamental definitions and principles of in- the meaning of system replies. The new CTS system tonation underlying our CTS system. embodies and extends theoretical work on intona- 2.1 Intonational System tional meaning in a more general, robust and rigor- In our CTS system, the prosodic elements that are ous way than earlier approaches, by reflecting com- computed are based on the intonational system of positional aspects of both dialogue and intonation Pierrehumbert (1980), who defined a formal lan- interepretation in an original computational frame- guage for describing American English intonation work for prosodic generation. using the following regular grammar:

1 Introduction Inton Phrase ---~ (Interm Phrase) + Bndry Tone Conversational systems that use speech as the input Interm Phrase ~ (Pitch Acc)+ Phrase Ace and output modality are often realized by architec- Major phrases, or inlonational phrases, are made tures that decouple speech processing components up of one or more minor phrases, or inlermediale from language processing components. In this pa- phrases. Melodic movements in intermediate and per, we show how speech generation can be more intonational phrases are in turn expressed by three closely coupled with the dialogue manager of a work- kinds of tonal elements. These include six pilch ac- ing mixed-initiative spoken dialogue system. In par- cents: a low pitch excursion (L*), a high pitch excur- ticular, we use representations from the semantic in- sion (H*), or a combination of both low and high ex- terpretation, task model and dialogue strategy mod- cursions (L*+H, L+H*, H*+L, It+L*); two phrase ules to better communicate the meaning of system accents: a high (H-) or low (L-) tonal target that replies through prosodically appropriate synthetic guides the interpolation of the melodic contour from speech. final pitch accent to intermediate phrase ending; and While dialogue prosody has been a topic of much two boundary lones: a high (H%) or low (L%) tonal study, our implemented concept-to-speech (CTS) target that guides interpolation from phrase accent system offers original proposals for specific couplings to intonational phrase ending. of dialogue computation with prosodic computation. Further, it embodies and extends theoretical work 2.2 Intonational Meaning on intonational meaning in a more general, robust Theoretical work on intonational meaning has at- and rigorous way than earlier CTS systems, in an tempted to relate the grammatical elements of Pier- architecture that reflects compositional aspects of rehumbert's system -- pitch accent, phrase accent dialogue and intonation interpretation. and boundary tone, to interpretive processes at dif- ferent levels of discourse and dialogue structure. 2 Theoretical Foundations Hirschberg and Pierrehumbert (1986) conjectured In this work, we implement and extend the com- that the absence or presence of accentuation conveys positional theory of intonational meaning proposed discourse focus status, while the tonal properties of by Pierrehumbert and Hirschberg (1986; 1990), the accent itself (i.e. pitch accent type) convey se- who sought to identify correspondences between the mantic focus information.

48 MIMIC: hello this is mimic the movie information system how can I help you User: where in hoboken is october sky playing MIMIC: october sky is playing at hoboken cinema in hoboken can I help you with anything else User: when is it playing there MIMIC: october sky is playing at hoboken cinema in hoboken at 3:45pm, 5:50pm, 7:lOpm, and lOpm can i help you with anything else

Figure 1: A MIMIC dialogue.

In later work, pitch accent type was said to ing information involving knowledge about towns, express whether the accented information was in- theaters, movies and showtimes, as demonstrated tended by the speaker to be "predicated" or not by in Figure 1. MIMIC currently utilizes template- the hearer (Pierrehumbert and Hirschberg, 1990). driven text generation, and passes on text strings Nompredicated-~forernation was said to bear low- to a stand-alone TTS system. In the version of star accentuation (L*-, L*+H, H+L*), while predi- MIMIC enhanced with concept-to-speech capabili- cated information would be marked by high-star ac- ties, MIMIC-CTS, contextual knowledge is used to cents (H*, L+H*, H*+L). The theory further stated modify the prosodic features of the slot and filler that L*+H conveys uncertainty or lack of speaker material in the templates; we are currently integrat- commitment to the expressed propositional content, ing the algorithms in MIMIC-CTS with a grammar- while L+H* marks correction or contrast. The com- driven generation system. Further details of MIMIC plex accent, H*+L, was said to convey that an infer- are presented in the relevant sections below, but see ence path was required to support the predication; (Chu-Carroll, 2000) for a complete overview. usage of H+L* similarly was said to imply an in- ference path, but did not suggest a predication of a 3.2 TTS: The System mutual belief. Finally, phrase accents and bound- For default prosodic processing and speech synthe- ary tones were said to reflect aspects of discourse sis realization, we use a research version of the structure. Bell Labs TTS System, circa 1992 (Sproat, 1997), that generates intonational contours based on Pier- 3 Systems Foundations rehumbert's intonation theory (1980), as described in (Pierrehumbert, 1981). Of relevance is the fact Our task is to improve the communicative compe- that various pitch accent types, phrase accent and tence of a spoken dialogue agent, by making re- boundary tones in Pierrehumbert's theory are di- course to our knowledge of intonational meaning, di- rectly implemented in this system, so that by gener- alogue processing and relations between the two. Of ating a Pierrehumbert-style prosodic transcription, course, a worthwhile CTS system must also outper- the work of the CTS system is done. More pre- form out-of-the-box text-to-speech (TTS) systems cisely, MIMIC-CTS computes prosodic annotations that may determine prosodic mark-up in linguisti- that override the default prosodic processing that is cally sophisticated ways. As in (Nakatani, 1998), we performed by the Bell Labs TTS system. take the prosodic output of an advanced research To our knowledge, the intonation component of system that implements the Pierrehumbert theory the Bell Labs TTS system utilizes more linguistic of intonation, namely the Bell Labs TTS system, knowledge to compute prosodic annotations than as our baseline experimental system to be enhanced any other unrestricted TTS system, so it is reason- by CTS algorithms. We embed the CTS system in able to assume that improvements upon it are mean- MIMIC, a working spoken dialogue system repre- ingful in practice as well as in theory. senting state-of-the-art dialogue management prac- tices, to develop CTS algorithms that can be eventu- 4 MIMIC's Concept-to-Speech ally realistically evaluated using task-based perfor- mance metrics. Component (MIMIC-CTS) In MIMIC-CTS, the MIMIC dialogue system is en- 3.1 Dialogue System: Mixed-Initiative hanced with a CTS component to better communi- Movie Information Consultant cate the meaning of system replies through contex- (MIMIC) tually conditioned prosodic features. MIMIC-CTS The dialogue system whose baseline speech gen- makes use of three distinct levels of dialogue rep- eration capabilities we enhance is the Mixed- resentations to convey meaning through intonation. Initiative Movie Information Consultant (MIMIC) MIMIC's semantic representations allow MIMIC- (Chu-Carroll, 2000). MIMIC" provides movie list- CTS to decide which information to prosodically

49 highlight. MIMIC's task model in turn determines Even such minimal use of dialogue information how to prosodically highlight selected information, can make a difference. For example, changing the based on the pragmatic properties of the system default accent for the following utterance highlights reply. MIMIC's dialogue strategy selection process the kind of information that the system is seeking, informs various choices in prosodic contour and ac- instead of highlighting the semantically vacuous centing that convey logico-semantic aspects of mean- main verb, like: 2 ing, such as contradiction. Default TTS: what movie would you LIKE 4.1 Highlighting Information using MIMIC-CTS: what MOVIE would you like Semantic Representations MIMIC employs a statistically-driven semantic in- 4.2 Conveying Information Status using terpretation engine to "spot" values for key at- the Task Model tributes that make up a valid MIMIC query in a MIMIC performs a set of information-giving tasks, robust fashion) To simplify matters, for each ut- i.e. what, where, when, location, that are concisely terance, MIMIC computes an attribute-value ma- defined by a task model. MIMIC processes the trix (AVM)-~epresentation, identifying important AVM for each utterance and then evaluates whether pieces of information for accomplishing a given set it should perform a database query based on the of tasks. The AVM created from the following ut- task specifications given in Figure 3. The task terance, "When is October Sky playing at Hoboken mode] defines which attribute values must be filled Cinema in Hoboken?", for example, is given in Fig- in (Y), must not 56 filled in (N), or may optionally ure 2. be filled in (-), to "license" a database query action. If no task is "specified" by the current AVM state, Attribute 11 Value Task when Task Movie Theater Town Movie October Sky Theatre Hoboken Cinema Town Hoboken Time Figure 3: Task Specifications for MIMIC.

Figure 2: Attribute Value Matrix (AVM), computed by MIMIC's semantic interpreter. MIMIC employs various strategies to progress toward a complete and valid task specification. Attribute names and attribute values are critical For example, in response to the follgwing user to the task at hand. In MIMIC-CTS, attribute utterance, MIMIC initiates an information-seeking names and values that occur in templates are typed, subdialogue to instantiate the theater attribute so that MIMIC-CTS can highlight these items in value to accomplish a when task: the following way: User: when is october sky playing in hoboken 1. All lexical items realizing attribute values are MIMIC-CTS: what THEATER would you like accented. To better convey the structure of the task model, 2. Attribute values are synthesized at a slower which is learned by the user through interaction speaking rate. with the system, we define four information statuses 3. Attribute values are set off by phrase bound- based on properties of the task model, which align aries. on a scale of given and new in the following order: 4. Attribute names are always accented. OLD INFERRABLE KEY HEARER-NEW These modifications are entirely rule-based, given a [given] [new] list of attribute names and typed attribute values. KEY information is that which is necessary to 1 Specifically, MIMIC uses an n-dimensional call router front-end (Chu-Carroll, 2000), which is a generalization of formulate a valid database query, and is exchanged the vector-based call-routing paradigm of semantic interpre- and (implicitly or explicitly) confirmed between tation (Chu-CarroU and Carpenter, 1999); that is, instead of the system and user. INFERRABLE information is detecting one concept per utterance, MIMIC's semantic in- not explicitly exchanged between the system and terpretation engine detects multiple (n) concepts or classes conveyed by a single utterance, by using n call touters in 2In the examples, small capitalization denotes a word is parallel. accented.

50 Task Specification Status Information Status Pitch Accent Required (Y) KEY L+H* Optional (-) INFERRABLE/OLD L*+H/L* Not allowed (N) HEARER-NEW H*

Table 1: Highlighting relevance of information based on task model (and discourse history).

User: where in montclair is analyze this playing MIMIC: analyze this is playing at welhnont theatre and clearviews screening zone in mont clair ANALYZE THIS is PLAYING at WELLIMONT THEATER and LWH* LWH* L* - H* H* L-H% CLEARVIEWS SCREENING ZONE in MONTCLAIR ~= - H* H* H* L-H% LWH* L-L%

Figure 4: Above, dialogue excerpt of MIMIC performing a where task. Below, the modified version of the bold-faced reply string, generated by MIMIC-CTS. user, but is derived by MIMIC's limited inference 4.3 Assigning "Dialogue Prosody" using engine that seeks to instantiate as many attribute Dialogue Strategies values as possible. For instance, a theater name As in earlier CTS systems, special logico-semantic may be inferred given a town name, if there is only relations, such as contrast or correction, are effec- one theater in the given town. OLD information tively conveyed in MIMIC-CTS by prosodic cues. In is inherited from the discourse history, based on MIMIC-CTS, however, these situations are not stip- updating rules relying on confidence scores for ulated in an ad hoc manner, but can be determined attribute values. HEARER-NEW information (c.f. to a large degree by MIMIC's dialogue strategy se- (Prince, 1988)) is that which is requested by the lection process that identifies appropriate dialogue user, and constitutes the only new information on acts to realize a dialogue goal. a the scale. But note that KEY information, while For example, the dialogue act ANSWER may be given, is still clearly in discourse focus, along with selected to achieve the dialogue goal of providing an HEARER-NEW information. answer to a successful user query, while the dialogue act NOTIFYFAILURE may be performed to achieve The next step is to map the information statuses, the dialogue goal of providing an answer in situations ordered from given to new, to a scale of pitch where no movie listing in the database matches the accent, or accent melodies, ordered from given to user query. The template associated with the di- new as follows: alogue act, NOTIFYFAILURE, when compared with that for ANSWER, contains an additional negative auxiliary associated with the key attribute responsi- ble for the query failure, in an utterance conveying a L* L*÷H L+H* H* contradiction in beliefs between the user and system &ivan] [new] (namely, the presupposition on the part of the user that the query can be satisfied). Table 1 summarizes this original mapping of infor- Theoretical work on intonational interpretation mation statuses to pitch accent melodies, and Fig- leads us to prosodically mark the negative auxil- ure 4 illustrates the use of this mapping in an ex- iary, as well as the associated focus position (Rooth, ample. It obeys the general principle of Pierrehum- 1985). We choose to mark the negative auxiliary not bert and Hirschberg's work, that low tonality sig- with the L+H* pitch accent to convey correction, nifies discourse givenness and high tonality signifies while marking the material in the associated focus discourse newness, but extends this principle beyond position with the L*+H pitch accent to convey (the its vague definition in terms of predication of mutual 3Importantly, MIMIC's adaptive dialogue strategy selec- beliefs. Instead, the principle is operationalized here tion algorithm takes into account the outcome of an initia- in a practically motivated manner that is consistent tive tracking module that we do not discuss here (see (Chu- with and perhaps illuminating of the theory. Carroll, 2000)).

51 User: where is the corruptor playing in cranford MIMIC: the corruptor is not playing in cranford the corruptor is playing at lincoln cinemas in arlington THE CORRUPTOR is NOT playing in CRANFORD L+H* L+H* LWH* !H* L*+H L-H%

Figure 5: Above, dialogue excerpt of MIMIC performing a NOTIFYFAILURE dialogue act. Below, the modified version of the bold-faced reply string, generated by MIMIC-CTS. Note the diacritic "!" denotes a downstepped accent (see (Pierrehumbert, 1980)).

system's) lack of commitment to the (user's) pre- mation system (House and Yond, 1990) also relied supposition at hand. Finally, the NOTIFYFAILURE on discourse and task models. Mutually known en- dialogue act is conveyed by assigning the so-called tities, said to be in negative focus, were deaccented; rise-fall-risd-cbntfadiction contour, L*+tt L-H%, to entities in the current task space, in referring focus, the utterance-at large (c.f. (Hirschberg and Ward, received (possibly contrastive) accenting; and enti- 1991)). An example generated by MIMIC-CTS ap- ties of the same type as a previously mentioned ob- pears in Figure 5. Note that pitch accent types for ject, were classified-as in either referring or emphatic the remaining attribute values are assigned using the focus, depending on the dialogue act~ in the cases task model, as described in section 4.2. Thus in Fig- of corrective situations or repeated system-intitiated ure 5, the movie title is treated as KEY information, queries, the contrasting or corrective items were em- marked by the L+H* pitch accent. phatically accented. MIMIC-CTS contains additional prosodic rules The BRIDGE project on speech generation for logical connectives, and clarification and confir- (Zacharski etal., 1992) identified four main factors mation suhdialogues. affecting accentability: linear 0rder, lexical category, semantic weight and givenness. In relatedwork 5 Related Work (Monaghan, 1994), word accentability was quanti- Although a number of earlier CTS systems have tatively scored by hand-crafted rules based on infor- captured linguistic phenomena that we address in mation status, semantic focus and Word class. The our work, the computation of prosody from dialogue givenness hierarchy of Gundel and'colleagues (1989), representations is often not as rigorous, detailed or which associates lexical forms of expression with in- complete as in MIMIC-CTS. Further, while several formation statuses, was divided into four intervals, systems use given/new information status to decide with scores assigned to each. A binary semantic fo- whether to accent or deaccent a lexical item, no sys- cus score was based on whether the word occurred tem has directly implemented general rules for pitch in the topic or comment of a sentence. Finally, lex- accent type assignment. Together, MIMIC-CTS's ical categories determined word class scores. These computation of accentuation, pitch accent type and scores were combined, and metrical phonological dialogue prosody constitutes the most general and rules then referred to final acce'ntability scores to complete implementation of a compositional theory assign a final accenting pattern. of intonational meaning in a CTS system to date. To summarize, all of the above CTS systems em- Nevertheless, elements of a handful of previ- ploy either hand-crafted or heuristic techniques for ous CTS systems support the approaches taken representing semantic and discourse focus informa- in MIMIC-CTS toward conveying semantic, task tion. Further, only SUNDIAL makes use of dialogue and dialogue level meaning. For example, the Di- acts. rection Assistant system (Davis and Hirschberg, 1988) mapped a hand-crafted route grammar to a 6 Conclusion and Future Work discourse structure for generated directions. The We are presently carrying out evaluations of MIMIC- discourse structure determined accentuation, with CTS. An initial corpus-based analysis compares deaccenting of discourse-old entities realized (by lex- the prosodic annotations assigned to three ac- ically identical morphs) in the current or previous tual MIMIC dialogues, which were previously col- discourse segment. Other material was assigned ac- lected during an overall system evaluation (Chu- centuation based on lexical category information, Carroll and Nickerson, 2000). The corpus of di- with the exception that certain contrastive cases of alogues is made up of 37 system/user turns, in- accenting, such as left versus right, were stipulated cluding 40 system-generated sentences. Three ver- for the domain. sions of the MIMIC dialogues are being analysed, Accent assignment in the SUNDIAL travel infor- with prosodic features arising from three differ-

52 ent sources: MIMIC-CTS, MIMIC operating with J. Gundel, N. Hedberg, and R. Zacharski. 1989. default Bell Labs TTS, and a professional voice Givenness, implicature and demonstrative expres- talent who read the dialogue scripts in context. sions in English discourse. In Proceedings of CLS- This corpus-based assessment -- comparing the 25, Parasession on Language in Context, pages prosody of CTS-generated, TTS-generated, and hu- 89-103. Chicago Linguistics Society. man speech, will enable more domain-dependent and Janet Pierrehumbert. 1986. tuning of the MIMIC-CTS algorithms, as well as the The intonational structuring of discourse. In Pro- refinement of general prosodic patterns for linguis- ceedings of the 2~lh Annual Meeting of the Asso- tic structures, such as lists and conjunctive phrases. ciation for Computational -Linguistics, New York. Ultimately; the value of MIMIC-CTS must be mea- J. Hirschberg and G. Ward. 1991. The influence of sured based on its contribution to overall task pefor- pitch range, duration, amplitude, and spectral fea- mance by real MIMIC users. Such a study is under tures on the interpretation of l*+h I h%. Journal design, following (Chu-Carroll and Nickerson, 2000). of Phonetics. In conclusion, we have shown how prosodic com- Jill House and Nick Youd. 1990. Contextually ap- putation can be conditioned on various dialogue propriate intonation in speech synthesis. In Pro- representations, for robust and domain-independent ceedings of the European Speech Communication CTS synthesis. -While some rules for prosody as- Association Workshop on Speech Synthesis, pages signment depend on the task model, others must be 185-188, Autrans. tied closely to the particular choices of content in A. I. C. Monaghan. 1,994. Intonation accent place- the replies, at the level of dialogue goals and dia- ment in a concept-to-dialogue system. In Proceed- logue acts. At this level as well, however, linguis- ings of the ESCA/IEEE Workshop on Speech Syn- tic principles of intonation interpretation can be ap- thesis, pages 171-174, New Paltz, NY. plied to determine the mappings. In sum, the lesson C. H. Nakatani. 1998. Constituent-based accent learned is that a unitary notion of "concept" from prediction. In Proceedings of the 36th Annual which we generate a unitary prosodic structure, does Meeting of the Association for Computational not apply to state-of-the-art spoken dialogue gener- Linguistics, Montreal. ation. Instead, the representation of dialogue mean- J. Pierrehumbert and J. Hirschberg. 1990. The ing in experimental architectures, such as MIMIC's, meaning of intonational contours in the interpre- is compositional to some degree, and we take advan- tation of discourse. In Intentions in Communica- tage of this fact to implement a compositional theory tion. MIT Press, Cambridge, MA. of intonational meaning in a new concept-to-speech Janet Pierrehumbert. 1980. The and system, MIMIC-CTS. Phonetics of English Intonation. Ph.D. thesis, Massachusetts Institute of Technology, Septem- References ber. Distributed by the Indiana University Lin- guistics Club. Jennifer Chu-Carroll and Bob Carpenter. 1999. J. Pierrehumbert. 1981. Synthesising intonation. Vector:based natural language call routing. Com- Journal of the Acoustical Society of America, putational Linguistics, 25(3):361-388. 70(4):985-995. Jennifer Chu-Carroll and Jill S. Nickerson. 2000. Ellen Prince. 1988. The ZPG letter: subjects, defi- Evaluating automatic dialogue strategy adapta- niteness, and information status. In S. Thompson tion for a spoken dialogue system. In Proceed- and W. Mann, editors, Discourse Description: Di- ings of the 1st Conference of the North Ameri- verse Analyses of a Fund Raising Text. Elsevier can Chapter of the Association for Computational Science Publishers, Amsterdam. Linguistics, Seattle. Mats Rooth. 1985. Association with Focus. Ph.D. Jennifer Chu-Carroll. 2000. Mimic: an adaptive thesis, University of Massachusetts, Amherst MA. mixed initiative spoken dialogue system for infor- Richard Sproat, editor. 1997. Multilingual Text- mation queries. In Proceedings of the 6th Con- to-Speech Synthesis: The Bell Labs Approach. ference on Applied Natural Language Processing, Kluwer Academic, Boston. Seattle. Ron Zacharski, A. I. C. Monaghan, D. R. Ladd, J. R. Davis and :l. Hirschberg. 1988. Assigning into- and Judy Delin. 1992. BaIDGE: Basic research national features in synthesized spoken directions. on intonation for dialogue generation. Technical In Proceedings of the 26th Annual Meeting of the report, University of Edinburgh. Association for Computational Linguistics, pages 187-193, Buffalo. Barbara Grosz and Candace Sidner. 1986. Atten- tion, intentions, and the structure of discourse. Computational Linguistics, 12(3):175-204.

53