User Behavior Fusion in Dialog Management with Multi-Modal History Cues

Multimed Tools Appl DOI 10.1007/s11042-014-2161-5 User behavior fusion in dialog management with multi-modal history cues Minghao Yang & Jianhua Tao & Linlin Chao & Hao Li & Dawei Zhang & Hao Che & Tingli Gao & Bin Liu Received: 20 January 2014 /Revised: 26 May 2014 /Accepted: 23 June 2014 # Springer Science+Business Media New York 2014 Abstract It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi- modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech comprehension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and M. Yang (*) : J. Tao : L. Chao : H. Li : D. Zhang : H. Che : T. Gao : B. Liu The National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China e-mail: [email protected] J. Tao e-mail: [email protected] L. Chao e-mail: [email protected] H. Li e-mail: [email protected] D. Zhang e-mail: [email protected] H. Che e-mail: [email protected] T. Gao e-mail: [email protected] B. Liu e-mail: [email protected] Multimed Tools Appl independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation. Keywords Dialog management (DM) . Multi-modal data fusion . Human computer interaction (HCI) . Emotion detection 1 Introduction Natural human-like talking avatar, as an exciting modality for human-computer interaction (HCI), has been studied intensively for the last 10 years. Since the pioneering work of web woman newscaster Ananova [1], remarkable progress has been achieved in human-like talking avatar [8, 13, 18, 19, 24, 27, 34, 39]. With the improvement of speech recognition and natural language process, topic-centric human-computer interactions have achieved great progress and widely applied in call center services, web sales service, ticket and hotel booking [7, 10, 14], etc. However, in spite of the rapid development of user behavior analysis techniques, it’sstilla big challenge to make virtual avatars converse and response like a real person in human- computer interactions. One important cause is that avatar’s responses to user behaviors, especially to users’ supporting expression (emotion and gesture, etc.) do not meet their expectations. It is important to make avatar be smart to user behaviors with flexible feedback mechanism in DM, especially in face to face human-computer conversation applications. The traditional DM techniques could be divided into two types according to their compu- tational method: rule based DM and probability statistic DM. The former one designs all the dialogue paths in advance with knowledge database and possible response rules support behind the dialog. The rules could be planned with slot-filling [11, 40, 41] or finite-state machine (FSM) [20] techniques. These DMs could accurately answer users’ queries in the application such as telephone shopping, airlines ticket and hotel booking applications. The probability statistic DMs were developed to construct the dialog workflow based on statistics and machine learning, such as Bayesian networks [26], graphical model [31] and reinforcement learning [28], partially observable markov decision process (POMDP) [43]. Probability statistic DM generates the optimal response by searching the time-variant state spaces in interaction envi- ronments. It contributes avatar to generate satisfactory response in specific domain applications, such as virtual museum guide and virtual tour on web services [20, 21, 41]. However, no matter rule based DM or probability statistic DM, behavior sensitive mechanism is ignored in most of the traditional DMs, because speech is the main interaction modality in spoken DMs. Recently, researchers and engineers consider emotion recognition and behavior comprehension in DM. Some DMs product feedback sentences at user’s emotion peak points or when obvious emotion changes are detected. However, this point feedback mechanism is in essence not compatible with user continuous interaction in dialog process. A small number of discrete emotion and behavior events on single modality may not reflect the subtlety and complexity of the behaviors conveyed by rich multi-modal information. It causes avatar’s superficial response to one of user modality against users’ spontaneous multi-modal interaction intention. Multimed Tools Appl In cognitive sciences, researchers argue that the dynamics of multi-modal human behavior are crucial for its interpretation [30]. A number of recent studies [2, 4, 16, 29, 38] demonstrate this point of view. These works imply that combining the historical audio-visual information could improve the user behavior comprehension. However, it is still a big challenge to construct a user behaviors sensitive talking avatar. The challenge mainly comes from that it is difficult to present, arrange and interpret the multi-modal behavior information in conversation process. In spite of recent development of human emotion and behaviors comprehension, the existing methods to deal with the emotion, behavior, gesture and speech information are in essence psycholinguistic and discourse analytic method in a strictly formal manner. An effective multi-modal behavior fusion model and flexible behavior sensitive DM are necessary for practical human computer dialog systems. In this study, we adopt the following two ways to improve the avatar’s communication performance as well as to contribute its sensitivity to user behavior: (1) instead of detection user’s emotion peak point, we record user’s historical behaviors, and obtaining the behavior interpretation with a short-term and time-dynamic (STTD) fusion model before the topic going to next phase; (2) according to the different contributions of facial expression, gesture and head motion to automatic speech recognition (ASR), we divide the multi-modal behaviors into three categories: complementation, conflict and independence. Different user behavior category leads to different avatar response. As the STTD model could be easily embodied with the traditional DM techniques, the combination DM presents a flexible user behavior sensitive topic management framework in multi-modal human computer conversation. Finally, we further construct a drive route information query system by connecting the user behavior sensitive DM (BSDM) to a 3D talking avatar. The records of dialog history between the talking avatar and different users show that the proposed DM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation. The rest parts of this paper are organized as follows: an overview of our DM framework, including its structure, data flow and the function of each module will be introduced in section 2; how the STTD model is constructed and its features selection will be described in detail in section 3;section4 introduces how the user multi-modal behavior contribute to dialog topic maintenance or transfer according to the user behavior classification results obtained from STTD model; the experiments about BSDM used in a traffic route query system and the discussions are given in section 5; and section 6 concludes this paper. 2 System structure The structure of the proposed user behavior sensitive dialog management (BSDM) is presented in Fig. 1. It consists of three components: input module, dialog management (the green rectangle) and output module. Every module in our DM receives and deals with multi- modal information, including speech, facial expression, head motion, prosody and gesture information. According to the contributions of different modalities to speech comprehension, user behaviors mainly include two categories in our dialog managements. The first one is “explicit behavior”, which presents user’s obvious

Load more