<<

Multimed Tools Appl DOI 10.1007/s11042-014-2161-5

User behavior fusion in dialog management with multi-modal history cues

Minghao Yang & Jianhua Tao & Linlin Chao & Hao Li & Dawei Zhang & Hao Che & Tingli Gao & Bin Liu

Received: 20 January 2014 /Revised: 26 May 2014 /Accepted: 23 June 2014 # Springer Science+Business Media New York 2014

Abstract It enhances user experience by making the talking avatar be sensitive to user behaviors in human computer interaction (HCI). In this study, we combine user’s multi- modal behaviors with behaviors’ historical information in dialog management (DM) to improve the avatar’s sensitivity not only to user explicit behavior (speech command) but also to user supporting expression (emotion and gesture, etc.). In the dialog management, according to the different contributions of facial expression, gesture and head motion to speech compre- hension, we divide the user’s multi-modal behaviors into three categories: complementation, conflict and independence. The behavior categories could be first automatically obtained from a short-term and time-dynamic (STTD) fusion model with audio-visual input. Different behavior category leads to different avatar’s response in later dialog turns. Usually, the conflict behavior reflects user’s ambiguous intention (for example: user says “no” while he (her) is smiling). In this case, the trial-and-error schema is adopted to eliminate the conversation ambiguity. For the later dialog process, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget”. With the detection of complementation and

M. Yang (*) : J. Tao : L. Chao : H. Li : D. Zhang : H. Che : T. Gao : B. Liu The National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, , e-mail: [email protected] J. Tao e-mail: [email protected] L. Chao e-mail: [email protected] H. Li e-mail: [email protected] D. Zhang e-mail: [email protected] H. Che e-mail: [email protected] T. Gao e-mail: [email protected] B. Liu e-mail: [email protected] Multimed Tools Appl independence behaviors, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among four dialog states. At the first section of experiments, we discuss the reliability of STTD model for user behavior classification. Based on the proposed dialog management and STTD model, we continue to construct a drive route information query system by connecting the user behavior sensitive dialog management (BSDM) to a 3D talking avatar. The practical conversation records of avatar with different users show that the BSDM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation.

Keywords Dialog management (DM) . Multi-modal data fusion . Human computer interaction (HCI) . Emotion detection

1 Introduction

Natural human-like talking avatar, as an exciting modality for human-computer interaction (HCI), has been studied intensively for the last 10 years. Since the pioneering work of web woman newscaster Ananova [1], remarkable progress has been achieved in human-like talking avatar [8, 13, 18, 19, 24, 27, 34, 39]. With the improvement of speech recognition and natural language process, topic-centric human-computer interactions have achieved great progress and widely applied in call center services, web sales service, ticket and hotel booking [7, 10, 14], etc. However, in spite of the rapid development of user behavior analysis techniques, it’sstilla big challenge to make virtual avatars converse and response like a real person in human- computer interactions. One important cause is that avatar’s responses to user behaviors, especially to users’ supporting expression (emotion and gesture, etc.) do not meet their expectations. It is important to make avatar be smart to user behaviors with flexible feedback mechanism in DM, especially in face to face human-computer conversation applications. The traditional DM techniques could be divided into two types according to their compu- tational method: rule based DM and probability statistic DM. The former one designs all the dialogue paths in advance with knowledge database and possible response rules support behind the dialog. The rules could be planned with slot-filling [11, 40, 41] or finite-state machine (FSM) [20] techniques. These DMs could accurately answer users’ queries in the application such as telephone shopping, airlines ticket and hotel booking applications. The probability statistic DMs were developed to construct the dialog workflow based on statistics and machine learning, such as Bayesian networks [26], graphical model [31] and reinforcement learning [28], partially observable markov decision process (POMDP) [43]. Probability statistic DM generates the optimal response by searching the time-variant state spaces in interaction envi- ronments. It contributes avatar to generate satisfactory response in specific domain applications, such as virtual museum guide and virtual tour on web services [20, 21, 41]. However, no matter rule based DM or probability statistic DM, behavior sensitive mechanism is ignored in most of the traditional DMs, because speech is the main interaction modality in spoken DMs. Recently, researchers and engineers consider emotion recognition and behavior compre- hension in DM. Some DMs product feedback sentences at user’s emotion peak points or when obvious emotion changes are detected. However, this point feedback mechanism is in essence not compatible with user continuous interaction in dialog process. A small number of discrete emotion and behavior events on single modality may not reflect the subtlety and complexity of the behaviors conveyed by rich multi-modal information. It causes avatar’s superficial re- sponse to one of user modality against users’ spontaneous multi-modal interaction intention. Multimed Tools Appl

In cognitive sciences, researchers argue that the dynamics of multi-modal human behavior are crucial for its interpretation [30]. A number of recent studies [2, 4, 16, 29, 38] demonstrate this point of view. These works imply that combining the historical audio-visual information could improve the user behavior comprehension. However, it is still a big challenge to construct a user behaviors sensitive talking avatar. The challenge mainly comes from that it is difficult to present, arrange and interpret the multi-modal behavior information in conver- sation process. In spite of recent development of human emotion and behaviors comprehen- sion, the existing methods to deal with the emotion, behavior, gesture and speech information are in essence psycholinguistic and discourse analytic method in a strictly formal manner. An effective multi-modal behavior fusion model and flexible behavior sensitive DM are necessary for practical human computer dialog systems. In this study, we adopt the following two ways to improve the avatar’s communication performance as well as to contribute its sensitivity to user behavior: (1) instead of detection user’s emotion peak point, we record user’s historical behaviors, and obtaining the behavior interpretation with a short-term and time-dynamic (STTD) fusion model before the topic going to next phase; (2) according to the different contributions of facial expression, gesture and head motion to automatic speech recognition (ASR), we divide the multi-modal behaviors into three categories: complementation, conflict and independence. Different user behavior category leads to different avatar response. As the STTD model could be easily embodied with the traditional DM techniques, the combination DM presents a flexible user behavior sensitive topic management framework in multi-modal human computer conversation. Finally, we further construct a drive route information query system by connecting the user behavior sensitive DM (BSDM) to a 3D talking avatar. The records of dialog history between the talking avatar and different users show that the proposed DM makes the avatar be able to understand and be sensitive to the users’ facial expressions, emotional voice and gesture, which improves user experience on multi-modal human computer conversation. The rest parts of this paper are organized as follows: an overview of our DM framework, including its structure, data flow and the function of each module will be introduced in section 2; how the STTD model is constructed and its features selection will be described in detail in section 3;section4 introduces how the user multi-modal behavior contribute to dialog topic maintenance or transfer according to the user behavior classification results obtained from STTD model; the experiments about BSDM used in a traffic route query system and the discussions are given in section 5; and section 6 concludes this paper.

2 System structure

The structure of the proposed user behavior sensitive dialog management (BSDM) is presented in Fig. 1. It consists of three components: input module, dialog management (the green rectangle) and output module. Every module in our DM receives and deals with multi- modal information, including speech, facial expression, head motion, prosody and gesture information. According to the contributions of different modalities to speech comprehension, user behaviors mainly include two categories in our dialog managements. The first one is “explicit behavior”, which presents user’s obvious positive or negative interaction intention, such as speech command (“yes” or “no”), head nodding or shaking, “ok” or “victory” gestures, etc. The other category is “supporting expression”. Supporting expression is user’s nonverbal behavior, which usually helps to enhance user’s expression in speaking (such as facial expression, prosody and body movement, etc.). Multimed Tools Appl

Fig. 1 User behavior sensitive dialog management (BSDM)

User multi-modal behavior features are copied and extracted from camera and microphone, which are taken for behavior fusion in the front part of DM module with STTD model. For user’s explicit behaviors, we adopt a key words list to detect whether the user’s verbal description is conflict with his (her) other behavior modalities. Once conflict behavior happens, DM asks user to clarify his (her) intention with try-and-error schema. Besides the explicit behaviors, user’s supporting expression helps to enhance user’s expression in dialog. To make the talking avatar be sensitive to user supporting expressions in human computer interaction, we further define other two user behavior categories (complementary and independence) to describe the relation- ship among user’s supporting expressions. According to user’s different behavior type (com- plementary or independence), DM continues to determine the conversation topic to be main- tained or to transfer. The detail definition of the user behavior categories are given in section 4.1. As user’s conflict intention has been clarified with try-and-error schema in the front part of DM, in the following dialog phase, we divide all the avatar dialog states into four types: “Ask”, “Answer”, “Chat” and “Forget” for users’ complementation and independence behaviors (section 4.3). With the help of topic management techniques, the user supporting expression as well as his (her) explicit behavior could be estimated as triggers for topic maintenance or transfer among these four dialog states. Being different from traditional DM, depended on STTD model, all the user behaviors, including emotion, gesture and body movement are considered as the same important items as speech information for talking avatar. A variety of traditional DM techniques (such as slot-filling, finite-state machine (FSM), Bayesian networks and POMDP, etc.) could be effectively embedded in our proposed DM to manage the topic transfer. In section 5.3, we adopted slot-filling technique in our DM and construct a practical traffic routine query system. The experiments prove that our DM is effectively sensitive for user behavior in human computer conversation.

3 User behavior estimation

In term of the multi-modal behavior recognition, Long Short-Term memory (LSTM) method has been proved to be a successful model for continuous audio-visual affect recognition [16, Multimed Tools Appl

23]. It adopts recurrent neural network architectures which take into account past observations by cyclic connections in the network’s hidden layer. Being different from the original recurrent neural network, besides “input” and “output” gates, LSTM model imports “forget” gate to allow the memory cells to reset themselves when the network needs to forget past inputs. In spite of its high performance in continuous emotion classification, LSTM totally calculates 2,094,210 weights with 128 memory blocks on 30 epochs to 60 epochs. In order to provide effective human-computer interaction, the system’s computational complexity needs to be reduced when the dynamic features are of high quality per unit time [44]. On this point, the original LSTM meets challenge in practical applications. Moreover, the fluencies of gesture, head motion, ASR results to user emotion and how the “forget” gate was triggered were not discussed in LSTM. There are some other effective multi-modal emotion recognition methods, such as MHMM (mixed hidden markov model) [35], HCRF (hidden conditional random field model) [9], and UBM-MAP (Universal Background Model and maximum posteriori proba- bility) [6], etc. However, these methods mainly discussed facial and audio expression classi- fication, and they did not consider how gesture, linguistic and non-linguistic features are combined with audio-visual features for behavior estimation. Our DM determine the states’ transfer time by considering whether the emotional tendency is consistent with the topic direction. Bell and Gustafson [3] propose that user’s positive attitudes contribute to the dialog going into deep turn. The negative attitudes easily cause a topic transfer or lead to dialog break down. Inspired by the work proposed in [37], we develop a short-term and time- dynamic (STTD) fusion model to obtain user’s conversation attitude by combining his (her) multi- modal behaviors. The attitudes are further recorded and used to determine the topic transfer and “forget” time in later topic management. When user attitudes are against the topic directions or user is not interesting in current topic for a while, the “forget” or “chat” state will be triggered. In this time, dialog histories will be reset with the topic transferring to a new one (detail please refers to section 4.3).

3.1 Short-term and time-dynamic (STTD) fusion model

The fusion model proposed in [37] is constructed by two stages’ classifiers (Fig. 2(a)). SVM is utilized as the first stage classifier to determine current epoch emotion state. At the second stage classifier, the authors continue to take the Bayesian inference or SVM to combine the results from the stage classifier for final emotion classifier. Tao et al. [37] discussed and compare the recognition results with those of [16] on audio-visual sub-challenge topic on AVEC-2012 database. They claimed that LTSM model obtain the best classification results on emotion valence dimension, and SVM-Bayesian obtain the best results on emotion power dimension, while SVM-SVM get the best results on those of arousal and expectancy. In LSTM and in the model proposed in [37], the short-term features are combined to long- term features by recurrent neural network and two-layer classifier (Fig. 2(a)). These features are low-level features (local binary pattern for facial expression and MFCC for audio features with shift windows). However, as users’ time-dynamic behaviors, such as head motion, speech prosody and verbal description, contribute to topic maintenance and transfer in dialog process as well as short-term emotion change, we modified the fusion model [37] by importing time- dynamic behaviors at the second stage classifier based on the original two layer classifier model. Figure 2(b) present the structure of the STTD fusion model. In Fig. 2(a), ft is the short-term features at the frame t, which contains 5,999 facial features (local binary patterns (LBP)) and 1,941 audio features each frame [16, 37]. dt and ot are the first stage classifier results and the second stage results respectively. In Fig. 2(b),weadopt high-level visual facial features instead of the numerous low-level visual features in Fig. 2(a) [16, 37]. On the topic of high-level facial features selection, [17] proved that facial point Multimed Tools Appl

Fig. 2 a The original two layer classifier model proposed in [37]; b The proposed short-term and time-dynamic (STTD) fusion model (time-dynamic features are marked in gray circles at the second stages classifier) distribution model (PDM) was an effective description method in emotional classification. Xin et al. [42] further adopted CART and boosting algorithm to pick up the dominant high-level features in audio-visual emotion recognition. While in this work, we hope to obtain the following two advantages as well as reducing the computing and store space in STTD model with high-level features: (1) high-level features presents effective explanation and classification for user’s time-dynamic behaviors, such as agreement (nod), refusing (shaking head), leaving (turning body to a different direction), etc. In these cases, besides the facial emotion estimation, we adopt facial positions’ shift between frames as high- ′ level features for gesture estimation; (2) the first stage classification results (dt)obtained ′ from epoch features (ft) could be naturally combined with time-dynamic features (the mean value and variance of points’ shift distance along x or y coordinate, ASR, etc.) at the second stage for final emotion classification. These time-dynamic features are label pt and qt in Fig. 2(b).Here,pt could be the facial point shift between frames or the speech prosody features presented with MFCC. qt is linguistic information, head motion or ASR results which contain obvious user behavior in a period of time.

3.2 Visual features

Traditional tracking algorithms, such as Active Shape Model (ASM), Active Appearance Model (AAM), Viola–Jones detector [15] track feature points with low-level image characteristics in a local range, and the tracking results are often noisy since the various illumination or partial occlusion. The high-level constraint models have been proved effective for emotional classification [17, 42]. Inspired by these studies, we construct a high-level visual feature set based on the facial tracking algorithm [32]. Total 103 features points are obtained on the face with the method [32](Fig.3(a)). Dobrisek et al. [9] suggests that the points close to the eyes, mouth and eyebrows have dominant influence on visual emotion recognition. With this assumption, we select 29 points (Fig. 3(b)) and construct 22 high-level features as the short-term and time-dynamic visual features (Table 1). In Table 1, the definition of feature Vi (1≤i≤22) is given in the second column, where α, β, θ, γ are feature angles around eyes and eyebrows; Xi and Yi are the value of x coordinate and y coordinate of point pi (Fig. 3(b)); |Pi,Pj| is the distance between point pi and point pj;WandH are the length of |X1−X7| and max(|Y12−Y4|, |Y9−Y4|), which present the face width and height. The “Type” column presents features categories, where the features marked by “ST_F” ′ are short-term features inputted at the first stage (the features “ft” in Fig. 2(b)). The items marked by “TD_P” and “TD_Q” are time-dynamic features inputted at the second stage (“pt” and “qt” in Fig. 2(b)). Multimed Tools Appl

Fig. 3 a One hundred three face tracking points; b The selected 29 facial points used in STTD model

Table 1 The visual features in STTD model

Id Definition Type

V1 (θ+γ)/2π ST_F

V2 (α+β)/π ST_F

V3 (|P11,P16|+|P10,P18|)/H ST_F

V4 |P24,P26|/(|P24,P26|+|P27,P25|) ST_F

V5 |2*Y27−(Y22+Y23)|/H ST_F

V6 min(|Y27−Y25|, |Y26−Y24|)/max(|Y27−Y25|, |Y26−Y24|) ST_F

V7 |Y21−Y19|/H ST_F

V8 |Y17−Y15|/H ST_F

V9 |X18−X16|/W ST_F

V10 |Y13−Y14|/H ST_F

V11 |Y8−Y20|/H ST_F

V12 |Y14−Y24|/H ST_F

V13 |Y20−Y26|/H ST_F

V14 |Y29−Y28|/H ST_F

V15 |Y26−Y5|/H ST_F

V16 |Y24−Y9|/H ST_F

V17 min(|Y14−Y16|, |Y17−Y15|)/max(|Y14−Y16|, |Y17−Y15|) ST_F V min(|Y −Y |, |Y −Y |)/max(|Y −Y |, |Y −Y |) ST_F 18 18 20 21 19 18 20 21 19

V19 TD_P ∑ 29 j−∑29 j =ðÞ29 Ã X t X t−1 W j¼1 j¼1

V20 TD_P ∑ 29 j−∑29 j =ðÞ29 à Y t Y t−1 H j¼1 j¼1 t−n V21  TD_Q ∑ X i−X =ðÞn à W i¼t t−n V22  TD_Q ∑ Y i−Y =ðÞn à H i¼t Multimed Tools Appl

The items V1–V6 have been adopted in [42] as visual features for emotion recognition. Based on these features, we add V7–V18 as the short-term features at frame t. These 12 features record the distance between the feature points among mouth, eyes and eyebrows, which are inputted at the first ′ stage classifier of STTD model with V1–V6 as ft.TheitemsV19–V20 represent all points’ mean shift between current frame t and its previous frame along x coordinate and y coordinate respectively. These two features record the head motion between frames, and they are inputted at the second stage classifier as “pt”.V21–V22 are time-dynamic features (which are inputted as “qt”). They present the points shift variance along x coordinate and y coordinate between frame t-n to frame t, where n is the size of fusion window. The features (V1–V20) are calculated frame by frame, and the features (V21– V22) are calculated only at the end of utterances. In real practice dialog system, V21–V22 could be calculated synchronously with speech recognition. For the frames which are not utterances ending point, the values of V21–V22 could be set to small random values.

3.3 Audio features

For the audio features, being similar to that of [16], we extract 13 Mel-Frequency Cepstral Coefficients (MFCC) together with first and second order temporal derivatives every 25 ms with 10 ms shift window. We denote these 39 features A1–A39. Being combined with visual short-term features V1– V18, these 57 dimension MFCC parameters are inputted into the first stage classifier. At the second stage classifier, the following features has been confirmed to be useful for emotion speech classification: utterance duration, F0 range, the maximum of F0s, the minimum of F0s, the mean of F0s, the mean of energy, and the mean of durations [17]. Beside these parameters, we added some audio parameters in this work: the position of the maximum F0, the position of the minimum F0, the position of duration peak, the position of the minimum duration. Parameters describing laryngeal characteristics on voice quality were also taken into account. Kang and Tao [17] analyzed the importance of all these audio features for emotional speech. The results showed that the mean of F0s, the maximum of F0s, F0 range, the mean of energy, the mean of durations and the position of the minimum F0 have much better “resolving power” for emotion perception than other parameters. These 11 features (denoted as A40–A51) are selected as time-dynamic audio features “qt”,andare inputted at the second stage classifier in STTD model with the visual items V21–V22.

3.4 Linguistic and non-linguistic features

In conversation process, non-linguistic vocalizations (such as laughing, breathing and sighing) could be modeled by MFCC parameters [16], which are coincident with the audio features A1–A39.While for linguistic features, key words are detected for every audio chunk in our DM. According to the context means of different key words to user behaviors, we list some classical key words for three user behavior categories. Once the key words are detected from an audio chunk, the value of correspond time-dynamic linguistic feature “qt” are obtained. Different kinds of key word will be assigned different values. These values will be combined with the time-dynamic visual features (V21–V22) and audio features (A40–A51) at the second stage of STTD model. The details please refer to section 5.2.

3.5 Features fusion

Once all the features are obtained, they are inputted in STTD model according their categories and stages respectively (Fig. 2(b)). At the first stage, SVM is used to make a decision about the state of every single frame. Next, these isolated decision values are combined with time-dynamic features and inputted into the second classifier to determine the user behavior from frame t-n to frame t. Multimed Tools Appl

The basic SVM method is extended to create a non-linear classifier by using the kernel concept, where every dot product in the basic SVM formalism is replaced using a non-linear kernel functions. A maximum margin classifier (hyper-plane) is chosen to maximize the separability of the two classes. With all the features having clear physical meaning on emotion, gesture and semantic expression, the data are easily labeled and could be effectively trained in the STTD model. In STTD model, as the time-dynamic features “qt” (V21–V22 and A40–A51) are normalized to the range (0, 1), we label the output values of first stage SVM (dt)to“0” or “1”. Similarly, the output values of second stage SVM “ot” are labeled as “0” or “1”. At the front part of DM, this two class classification schema contributes to decide whether the user behaviors belong to “conflict” or not. And it further helps DM to decide whether the user behaviors belong to “complementary” or “independence” in later phase of DM as well. How the users’ multi- modal behaviors are recorded, labeled and classified with STTD model will be given in details in section 5.2.

4 Dialog management

An appropriate response mechanism helps the talking avatar to be sensitive for user interaction behavior. In this section, we present the details definition of three user behaviors (comple- mentation, conflict and independence) and the structure of BSDM.

4.1 User behavior model

All behaviors, not only the users’ query information but also their emotion and behavior, are needed synchronously consideration in dialog system. As what we have introduced before, according to the different contribution of prosody, facial expression and head motion to speech comprehension, we divide the multi-modal user behaviors into three categories: complemen- tation, conflict and independence.

1) Conflict. In this category, the emotion recognition results are confused with that of ASR or gesture, for examples: user smiles when he (she) is saying “no” or shaking his head. This confused behavior would happen for a person when he (her) wants to refuse others politely. These ambiguous multi-modal behaviors easily lead to inexact behavior under- standing for computer. Table 2 lists several conflict behaviors which might happen in daily life. Once these situations are detected, the DM first adopt trial-and-error scheme [5] to let user clarify his (her) intention before the next dialog turn. 2) Complementary. In this category, user emotion and behavior usually support what he (she) is saying. For examples: user says “Iwanttogothere” when he (she) points to a place; or when user does not understand the avatars’ responses, he (she) asks the questions again with interrogative expressions. In these situations, DM needs to combine two or more channels’ behavior information together to understand user’s intention. 3) Independent. In this category, the result of emotion recognition is independent of ASR. For examples: user smiles when he or she is listening. In this situation, user emotion and behaviors are most personal speech habits. User behaviors are totally independent of his (her) verbal description.

With the definition of three kinds of user behaviors, user’s confused behavior could be clarified with trial-and-error scheme at first. For the complementary and independence user behavior, computer determines how to response to user with the help of DM. Multimed Tools Appl

Table 2 The conflict user behaviors might happen in daily life

Number Visual-audio features ASR Head motion/gesture Non-linguistic features

1 Positive (happiness, smile…) Shaking (head, hands) 2 Positive (happiness, smile…) “No”/“Will not (Won’t)”/ “Impossible”/ “Don’t”… “impossible”… 3 Positive (happiness, smile…) “No”/“Will not (Won’t)”/ Shaking (head, hands) Laughing/Sighing “Impossible”/“don’t”… 4 Positive (happiness, smile…) “Ok”/“Right”/“Sure”… Shaking (head, hands) 5 Surprise, anger “Ok”/“Right”/“Sure”… Interrogative/Rhetorical tone 6 Negative (sadness, sorrow…) “Yes”/“Know”/“Ok”/ Laughing/Sighing “Right”… 7 Negative (sadness, sorrow…) Nod Laughing/Sighing 8 Negative (sadness, sorrow…) “Yes”/“Know”/“Ok”/ Nod Laughing/Sighing “Right”…

4.2 Dialog management

No matter the rule based DM or the probability statistic based DM, one core function of DM is to decide the appropriate time to response to user and to keep the conversation smooth. In daily dialog process, most responses could be classified into three styles: “ask”, “answer” and “chat”. Answer could be viewed as the dialog continue for current topic, related techniques such as slot-filling [11, 40, 41], finite-state machine (FSM) [20] and POMDP [43] could be used to generate answering. “Ask” and “chat” happen on both phases of topic maintenance and topic transfer according to users’ different attitudes. Based this assumption, we propose a dialog management prototype with three dialog states: “Ask”, “Answer” and “Chat” (Fig. 4). As what we have introduced before, “trial-and-error” strategy [5] was effective to confirm user conflict behaviors before the topic go to next turn. The questions usually are started by DM, which belongs to system-initiative dialog strategy. The “trial-and-error” time could be determined by STTD model when user conflict behaviors are detected. At any time of dialog process, once user submits a question, system tries to answer user’s query. This case belongs to user-initiative dialog. Once the system cannot find answers, it asks user to offer more materials for response generation. This case is system-initiative dialog strategy.

Fig. 4 The three states DM prototype Multimed Tools Appl

In dialog process, when user complementation or independence behavior is detected, system generates response according to the state transfer strategy in DM (Figs. 4 and 5). For example in an airline tickets booking application constructed by slot-filling technique, sup- posing the slots are departure city, departure time, destination city and destination time, when user submit questions and if there were empty slots, system goes to “Ask” state and ask user to fill the empty slots. Once all the slots are filled, system transfers into “Answer” state. If we take POMDP techniques in tickets booking application, system “answers” the user question based on the maximal reward criterion. Once the expected rewards are not obtained, system “asks” user to present more information and search the state spaces [43]. Once a query is finished and there is no new query from user, system transfers to “Chat” state and presents some news about the departure or destination city. No matter slot-filling or POMDP technique in use, this DM prototype belongs to mix-initiative dialog strategy, where user and system could start a topic freely.

4.3 Behavior sensitive DM (BSDM)

We continue to construct a user behavior sensitive DM by connecting the STTD model (section 3.1) to the DM prototype proposed in Fig. 4. The structure of the full BSDM is showed as Fig. 5. In Fig. 5, the left part in yellow rectangle is STTD model, and the right part in light blue rectangle is DM. Two parts are connected by blue lines, which are the data flow between two modules. If the user behaviors are conflict, system adopt “trial-and-error” schema to ask user to clarify his (her) intention again (the line A and the line B in Fig. 5). With “trial-and-error” schema, the ambiguous behaviors are detected and eliminated before topic continues. This method gives the system ability to “feel” the user’s ambiguous emotion states. Like the process flow of conflict type, the response of user complementary behavior might go to the “Ask” state. The difference between this flow and that of conflict category is that the former one considers not only ASR results but also the user’s supporting expression and body movements. While response of user conflict behavior only check “Ask” state, asks user to clarify his (her) intention, where sentences are returned to user directly without state transfer in DM. When user is talking, their gestures and motions are most personal speech habits without special means. At these moments, user behaviors are often independent from his (her) speech. If there is no distinct emotion expression or drastic gesture changes in talking, user behaviors

Fig. 5 The complete structure of user behavior sensitive DM (BSDM) Multimed Tools Appl could be viewed totally independent from the ASR results. Therefore, in user independence behaviors, ASR has the most dominant influence on topic transfer than other channels. In these cases, the dialog processes are similar to those of the DM prototype presented in section 4.2. After the STTD model connected with DM, we call this new DM as behavior sensitive DM (BSDM). The BSDM has the following advantages in comparison with the traditional DMs:

(1) In BSDM, the user behavior and emotion changes contribute to dialog state transfer as well as speech content, which improves the user experience on human computer conversation. Table 3 lists the events which triggers dialog transfer among “Ask”, “Answer”, “Chat” and “Forget” states in BSDM (the work mechanism of “Forget” state will be introduced in next paragraph). The emotion and behavior events labeled in bold in Table 3 are the new triggers which are not existed in the original DM prototype presented in Fig. 4. For example in original DM, only when all slots are filled or the reward is beyond certain threshold, dialog state goes from “Ask” to “Answer”. While in BSDM, “Positive Emotion or Gesture” and “Independent Behaviors without Speech” might lead to the topic transfer as well. For example, when system ask user questions with “trial- and-error” schema, “Positive Emotion or Gesture” event (such as user’s head nodding, “ok” or “victory” hand gesture) make the user intention be clear and system will transfer to “Answer” state from “Ask” state. It is worth noting that “Independent Behaviors without Speaking” could trigger system transfer from “Answer” state to “Ask” state. For example, when system answers some questions, while user presents “unknown” or “incomprehension” expression without speaking for a while, then system will turn to “Ask” state by clearing a recent slot or trimming a search path from POMDP states spaces. The dialog process will begin a new turn by recalling the dialog branch and reminding user input again.

Table 3 The state transfer triggers in user Behavior Sensitive DM

Num Triggers Possible states transfer

1 • Slots-Complete (or Maximal Gain Found) Answer-Answer • Positive Emotion or Gesture Ask-Answer 2 • Slots Empty (Insufficient Reward) Answer-Ask • Negative Emotion or Negative Gesture • Independent Behaviors without Speaking 3 • Slots Empty (Insufficient Reward) Ask-Ask • Negative Emotion or Gesture • Complementary Behaviors 4 • Answer Complete Answer-Chat • Obvious Emotion Polarity Change(Negative–positive) • Independent Behaviors without Speaking 5 • New Questions Chat-Answer 6 • Questions From User Chat-Chat • Topic Transfer 7 • Obvious Emotion Polarity Change(Positive–negative) Ask-Chat 8 • New Questions From User Chat-Ask • Obvious Emotion Polarity Change 9 • User’sGoingAway Chat- Forget • User’s Long Period Negative Emotion 10 • Clear All Slots (Reset) Forget-Ask Multimed Tools Appl

(2) The BSDM is able to “forget” dialog content at appropriate time. This new “forget” state makes the system more flexible with different users. As the triggers listed in the 9th and 10th items in Table 3, when system detect that there is no user in view field for a period of time, or user presents long period negative expression before camera, the system will start new dialog process by transferring to “Forget” state and resetting the dialog history. (3) The BSDM makes the “Chat” state response be more personal feeling in HCI conver- sation. Different users have their own different personal speech habits. For example, for a same response from system in “Chat” states, some users possibly present interesting feeling by asking new questions, some users maybe smile or nod without saying anything. For the formal users, system will transfer to “Answer” state and provide user questions. And for the later users, system will continue to chat something else to maintain the topic. And in this period, if the users appear “boring” expression for a while, system will transfer to “Ask” state or “Forget” state and start a new topic. These different schemas ensure the dialog processes go smoothly and enhance user’s interaction experience.

5 Experiments and discussions

In experiments, we aim to verify that BSDM is effectively sensitive for user behavior in following two points: (1) the proved high-level features and STTD model are effective for emotional and user behavior classification, especially for user’s conflict behavior detection; (2) we construct a routine query system by connecting the BSDM to a 3D talking avatar, and the dialog history between different users and taking avatar prove that the BSDM really improve the user experience.

5.1 Emotion detection

We first check whether the selected behavior features are effective for emotional and behavior classification. On this point, our selected audio and linguistic features are the same with those features used in [16]. The most difference between our selected features and those of [16]and [37] are that we adopt high-level visual features instead of the low-level LBP features. Xin et al. [42] have proved that the features V1–V6 are effective high-level features for emotion detection. Thus, in this part, we mainly confirm that high-level visual short-term features (V7– V18) are effective for emotion classification. We divide the visual short-term features into three set: S1(V1–V6), S2(V7–V18), S3(V1–V18) and we further compare the emotional recognition precision on these three set. The test database is NLPR emotion database [25]. Figure 6 lists some samples from the NLPR emotional database. It contains 30 subjects (15 males and 15 females) and each one of them was asked to sit in the noise reduction environment, and to talk about 2 h with exaggerated expression like drama actors/actresses. The original data was labeled by three annotators with

Fig. 6 Samples selected from NLPR emotional database Multimed Tools Appl

“happiness”, “sadness”, “anger”, “fear”, “surprise” and “neutral” piece by piece. While in our experiments, the emotion polarities are viewed as key triggers for topic transfer in BSDM. We take the “happiness” and “neutral” pieces as positive emotion and the “sadness”, “anger”, “fear” pieces as negative emotion. We randomly pick 500 positive and 500 negative emotion pieces from [25]. As we aim to compare the classification results on S1,S2 and S3,atthistest, only the visual short-term features are inputted in single layer SVM with 5–fold cross validation on the picked pieces from NLPR emotional database. Figure 7(a) presents the positive and negative facial emotion classification results for S1,S2,S3. We can see from Fig. 7(a) that the original visual set S1 obtains 77.02 and 71.69 % classification precision on positive and negative emotion, which are better than those of S2 (73.61 and 69.80 %). And the precision of S3 (S1+S2) obtain obvious improvement on emotion classification than those of S1 and S2, which reaches 79.95 and 78.26 %. It proves that the proposal high level feature set S2 (V7–V18) is effective for emotion classification and contributes to improve the emotion polarity classification with S1 (V1–V6). Figure 7(b) presents the positive and negative audio-visual emotion classification results with our STTD model for different size of fusion window. In Fig. 7(b), the values along x coordinates are the fusion window size at the second stage of STTD model. The values along y coordinates are the emotional recognition precision with 5-fold cross validation on NLPR emotional database. The window size is ranged from 1 to 32. When the fusion size is 1, the STTD model is degraded into single layer SVM algorithm. With the size of sample window increasing, the emotion recognition precision is increasing, and when the fusion size is larger than 22 and 19, the precision growth of positive and negative emotion is not obvious. Tao et al. [35] presented their audio-visual emotion recognition precision on NLPR emotion database respectively. The results were about 82.0 %(“neutral”), 81.0 %(“happiness”), 80.5 %(“sadness”), 84.0 % (“anger”), 84.0 %(“fear”)and83.0%(“sur- prise”) respectively. We could see from Fig. 7(b) that STTD model obtain 83.0 % precision on positive emotion recognition (“neutral” and “happiness”), which is higher than those of the MHMM algorithm proposed in [35]. While for negative emotion classification (including “sadness”, “anger” and “fear”), STTD model obtain the value of 82.0 %, which is slightly lower than the mean recognition value of MHMM on the negative emotions (about 82.88 %). With the comparison between our method and the method proposed in [35] on audio-visual emotion classification, we can see that the STTD model obtain competitive classification results on positive and negative emotion classification. It prove that the high-level features and STTD model are effective for emotional and user behavior classification. In next section, we

a. b.

Fig. 7 a The positive and negative emotion recognition results with visual short-term features S1,S2,S3 respectively; b The positive and negative audio-visual emotion classification results with our STTD model on different size of fusion window Multimed Tools Appl further check whether the STTD model is effective for user’s behavior detection with time- dynamic (including gesture and linguistic) features input at the second stage.

5.2 User behavior classification

In this part, we aim to check the proposed STTD model is effectiveness for user behavior classification. As the AVEC2012 database and NLPR emotional database are labeled without head motion and gesture, to verify the effectiveness of our model on behavior analysis, we construct a multi-modal user behavior database and verify the model on the constructed database for behavior classification. We invited 6 candidates (3 males and 3 females) to sit in the office environment with microphone and web camera (resolution 640×480) in front of them. They were asked to display facial expressions, shake or nod his (her) head and speak in natural way. Each candidate was asked to play at least 3 times for the behavior items listed in Table 4.Forthe supporting expression, such as the non-linguistic features in Table 4, candidates could choice to play or not himself (herself). After all speakers’ information has been collected, the data was labeled by three annotators with “Conflict”, “Complementation” and “Independence” in piece by piece. If two or more labels marked by any annotator are not coordinate with the intended labels marked by player himself (herself), the pieces will be removed from the database. In this way, we obtain 120 behavior pieces from total 150 behavior pieces, and each piece lasts about 10–20 s. Figure 8 lists some samples of our database. And for the pieces which are not labeled with “Conflict”, these pieces were continuously asked to be labeled as “Positive” or “Negative”, which were used for user behavior attitude detection. The SPTK toolkit [33] and the facial tracking algorithm [32] were used to obtain the audio and visual features. We validate the STTD model for behavior classification with 5-fold cross validation according the order of complementation, independence and conflict. As the values of visual features belong to [0, 1], we normalize all audio features to the range [0, 1] by their mean value and variance in each sample window. The sample size is taken 22 in the second stage of STTD model. For ASR results, there are four kinds of key words in Table 4: positive (“ok”, “yes”, …), negative (“no”, “don’t”, “impossible”, “won’t”, …), demonstrative pronouns (“there”, “here”, “this”, “that”, …) and “NULL”. To keep the values of linguistic features to be

Table 4 The user behaviors in our DM

Category Visual Key words (ASR) Head motion/gesture Non-linguistic features features

Conflict (Conflict_1) Smile Shake (head, hands) (Signing/Interrogative or rhetorical tone) Conflict (Conflict_1) Smile “No”/“Will not(Won’t)”/ (Signing/Interrogative “Don’t”/“Impossible” or rhetorical tone) Conflict (Conflict_1) Smile Shake (head, hands) (Signing/I nterrogative or rhetorical tone) Conflict (Conflict_2) Sad “Ok”/“All right”/ (Laughing) “Yes”/“Like”/“Sure” Conflict (Conflict_2) Sad Nod (Laughing) Complementation Random Face “There”/“Here”/“That (Random Head/ (Random non-linguistic Expression place”/“This”/“That” Gesture) Motion vocalizations) Independence Random Face (Random Head/Gesture (Random non-linguistic Expression Motion) vocalizations) Multimed Tools Appl

Conflict_1 (smiling, shaking head without speech input)

Conflict_2 (sadness, nodding without speech input)

Conflict_1 (smiling, shaking head with speech input)

Conflict_2 (sadness, nodding with speech input)

Independence (positive emotion, random head motion with speech input)

Fig. 8 Some samples of the constructed multi-modal database coincident to those of visual-audio features and to be scattered distribution in the range [0, 1], the values of these four types’ key words obtained from ASR are set 0.8 (positive), 0.6 (negative), 0.4 (demonstrative pronoun) and 0.2 (“NULL”) respectively as time-dynamic features “qt” at the second stage of STTD model. We use google ASR API [12] for keyword recognition. Tables 5 and 6 list the fussy matrix of user behavior classification when ASR results are used and not used in STTD model, where “Comp”, “Ind” and “Conf” represents complementation, independence and conflict respectively. It is noticeable that “complementation” behaviors are similar to “independence” behaviors if there were no speech input in Table 4. First we take “complementation” and “independence” as one term “Comp(Ind)” and compare it’s recognition precision with that of “conflict”.The comparison fussy matrix is given in Table 5. When there is no speech input, we can see from Table 5 that “Comp(Ind)” and “Conf” could be effectively detected by STTD model. The correct classification ratio for “Comp (Ind)” and “Conf” are 78.38 and 83.33 %. It demon- strates that when there is no speech input, user’s conflict behaviors could be effectively distinguished from complementation and independence behavior.

Table 5 Fussy matrix of user be- havior classification between Comp(Ind) Conf “Comp(Ind)” and “Conf” without ASR input at the second stage in Comp(Ind) 78.38 % 24.62 % STTD model Conf 16.67 % 83.33 % Multimed Tools Appl

Table 6 Fussy matrix of user be- havior classification for “comple- Comp Ind Conf mentation (Comp)”, “independence (Ind)” and “conflict (Conf)” with Comp 91.33 % 4.92 % 3.75 % ASR input at the second stage in Ind 0.12 % 81.54 % 18.34 % STTD model Conf 1.54 % 13.17 % 85.33 %

Table 6 presents the fussy matrix of recognition precision for “complementation (Comp)”, “independence (Ind)” and “conflict (Conf)” respectively. In this experiment, the pieces of any other two behavior categories are taken as negative samples for the detection behavior type in the training phase. After the ASR results are combined in STTD model in the second stage, we can see that the classification precision of “Comp” and “Ind” obtain 91.33 and 83.54 %, which are obviously higher than the classification results of the term “Comp(Ind)” in Table 5. It proves that the ASR results improve the classification precision for user behavior obviously. We further confirm the effectiveness of the STTD model for user’s inner conflict behavior classification (conflict_1 and conflict_2 in Table 4). Conflict_1 and conflict_2 behaviors are useful in HCI. For example, in human daily conversation, user’s conflict_1 behavior (user says “no” or shakes head when he (she) is smiling) could be a polite refuse, and conflict_2 behavior (user say “yes” with sad expression) maybe present user’s reluctant attitude in conversation. Table 7 presents the inner conflict behavior classification results with STTD model. We can see that after the complementation and independence behaviors are detected in advance, STTD model obtains convincing classification precision (92.48 and 89.33 %) on two inner conflict behaviors. It proves that the STTD model is effective for user behavior analysis in BSDM.

5.3 Dialog application

We construct a drive route information query system by connecting the proposed BSDM to a 3D talking avatar [36]. Figure 9 presents the user interface of the constructed drive route query system. There are total six modules in the system. The first module presents the face features points tracking results and the tracking face points are inputted into the STTD model as short- term or time-dynamic visual features according to the items listed in Table 1 (ft:V1–V18;pt:V19– V20;qt:V21–V22) respectively. The second module presents the time-variant information of users’ speech prosody. The prosody is taken as features pt in STTD. The third module is speech recognition module and the ASR results first are checked by the keyword list and adopted as time-dynamic features qt in STTD model. The selected features obtained from module 1, 2, 3 are inputted into BSDM and are further fused for user behavior classification. The BSDM is an invisible module behind the user interface, which records the dialog history automatically between user and talking avatar. The fourth module gives the 3D talking avatar. It generates the emotional animation based on the DM output [36]. The sixth module presents the traffic query results to users with 3D map information, which is constructed by Baidu map API [22].

Table 7 Fussy matrix of inner conflict (conflict _1 and conflict_2) behavior classification with ASR after complementation and independence behaviors were obtained and not considered in STTD model

Conflict_1 Conflict_2

Conflict_1 92.48 % 7.52 % Conflict_2 10.67 % 89.33 % Multimed Tools Appl

Fig. 9 The user interface of our virtual avatar traffic query system

5.4 Dialog records and analysis

We invited ten people of different ages to interact with the multi-modal traffic route query system by talking with the 3D talking avatar (Fig. 9). We list three different users’ dialog records in this section, and analyze how the user behaviors influence the dialog process. In these three users, one user is a boy, which dialog history is marked as A. And the other two users (B and C) are two middle-aged men. They asked the avatar with the same question “How can I go to YiHeYuan” to start the dialog, where YiHeYuan is a destination place. And once the conversation starts, users freely continue to interact with avatar depended on their own intentions in later turns. Tables 8, 9 and 10 list the three user chat histories with avatar respectively.

Table 8 The dialog history between user A and avatar

1 User A: How can I get to YiHeYuan? (Ask with calm expression) 2 Avatar: Where would you like to leave for YiHeYuan? (Independence behavior detected; “Ask” state; Spreading hands with interrogative tone) 3UserA: (Answer avatar’squestion) 4 Avatar: From XiZhiMen to YiHeYuan, you first go to ZhiChunLu, then go through YuanMingYuan,and you will arrive at YiHeYuan.. (Independence behavior detected, “Answer” state, with right hand pointing to the map) 5 User A: Thanks, how could I go to Bird’s Nest Stadium? (Smile and Ask) 6 Avatar: From XiZhiMen to Bird’sNest, please get to WuDaoKou at first and after China University of Geosciences, you will arrive at Bird’sNest. (Independence behavior detected, “Answer” state, with a hand pointing to the real-time 3D map 7UserA:(Looking up the map for a while in silence) 8 Avatar: Now, there is a heavy traffic around the North Fourth Ring Road of Beijing. You could go to Bird’sNestby subway instead. (Independence behavior detected, transfer to “Chat” State) 9 User A: Where is the subway transfer station? (Ask with a puzzled expression) 10 Avatar: ZiZhiMen in subway line 2, where you transfer to subway line 5 at YongHeGong station. (Complement behavior detected, transfer to “Answer” State) 11 User A: Ah, thanks. (Smile and Satisfy with the avatar’s answer) 12 Avatar: I am glad to be at your service. Are there other things can I help you? (Independence behavior detected, transfer to “Chat” state) 13 User A: No, this is ok. Thanks, Bye. 14 Avatar: See you, have a nice day. (Independence behavior detected, “Forget” state). Multimed Tools Appl

Table 9 The dialog history between user B and avatar

1 User B: How can I get to YiHeYuan? (Ask with calm expression) 2 Avatar: Would you like to go to YiHeYuan from ZhiChunLu? (Independence behavior detected, “Ask” state, Spreading hands with interrogative tone. 3 User B: Yes 4 Avatar: From ZhiChunLu to YiHeYuan, you first get to Tsinghua University, then you will arrive at YiHeYuan. (Independence behavior detected, “Answer” state, with left hand pointing to the routine map) 5 User B: How could I get to TianAnMen? (Ask again) 6 Avatar: From ZhiChunLu to QianMen, you first go to XiZhiMen, then go through XiDan,Atlast, …… (Independence behavior detected, “Answer” state) 7 User B: (Smile and Shake head) 8Avatar:Ah,won’tyougotoQianMen?(Conflict behavior detected, turn to “Ask” State) 9 User B: No, I want go to TianAnMen. (Smile and Answer) 10 Avatar: Yes, from ZhiChunLu to TianAnMen, you first go to XiZhiMen, then go through XiDan, At last, you will arrive at TianAnMen (Independence behavior detected, “Answer” state) 11 User B: That’s fine, thank you. (Being satisfied with the avatar’sanswer) 12 Avatar: It’s my pleasure. (Independence detected, “Chat” state, make a bow to user)

In the tables, the underline words are departure or destination places, and the sentences in bold are important user behaviors and avatar's state transfer detected in dialog management. The sentences in the 3rd, 5th and 7th rows of each table present users’ early statement with the avatar. For example, in the third row of Tables 8 and 9, user A provided departure place directly and user B nodded his head, the direct speech command and head gesture both lead to the avatar talking the detail traffic route information in next turn (the 4th row in Tables 8 and 9). While in the third row of Table 10, user C shaked his head, which caused avatar to ask user again about the departure place. With the user behavior analysis in the BSDM model, the user’s different gestures and speech commands contribute to avatar different responses in later dialog turns. The similar situations happen in the 9th row of Table 9, and in the 7th row of Table 10. User’s “conflict” behavior happen at the 7th row of Table 9, when avatar says “From ZhiChunLu to TianAnMen, …” in previous turn (in this query, “TianAnMen” was recognized as “QianMen” by ASR because of user B’s spoken accent). The mistake was found immediately by user, and he is smiling and shaking his head with avatar’s answer. The conflict user behavior was detected by BSDM with two or three seconds delay. Then avatar transferred to “Ask” state, asked user “Ah, won’tyougo to QianMen?” to let the user clarify their intention. In this case, the “trial-and-error” method ensured the dialog process smoothly and brought user nature feeling on interaction experience. “Independence” behaviors were detected in most dialog turns. In these turns, the user’s behavior could be viewed as independent with the dialog process. The complement behavior happened at 9th row of Table 8, where “that place” was an obvious linguistic cue for complement type detection. In this case, BSDM searched the slots with ASR results. The key words “subway transfer” and “that place” leaded to avatar transferring to answer the user’s query by searching subway route line, where an undirected connectivity bus and subway routine map was qualified for the route information query with departure and destination places.

5.5 Subjective evaluation on BSDM

Subjective evaluations on user behavior sensitive multi-modal dialog management are given by mean opinion score (MOS) scores, which is adopted for different conversation styles: 1) Multimed Tools Appl

Table 10 The dialog history between user C and avatar

1 User C: How can I go to YiHeYuan? (Ask with calm expression) 2 Avatar: Would you like to go to YiHeYuan from ZhiChunLu? (Independence behavior detected, “Ask” state, Spreading hands with interrogative tone. 3UserC:(Shakes head) 4 Avatar: Where would you like to leave for YiHeYuan? (Independence Detected, “Ask” state) 5UserC:. 6Avatar:FromBeijing Zoo to YiHeYuan, you go to ZhiChunLu at first, then go through Tsinghua University, ….. (Independence behavior detected, “Answer” state, with a hand pointing to the routine map). 7UserC:Ah, no, I want to go to YiHeYuan from Beijing West Railway Station. (User interrupt avatar and change the query place). 8Avatar:FromBeijing West Railway Station to YiHeYuan, you first go to XiZhiMen, then go through ZhiChunLu, At last, you arrive at YiHeYuan. (Independence behavior detected, “Answer” state). (Independence Detected, turn to “Chat” State) 9UserC:(Smiles with a happy expression) 10 Avatar: There is thunder shower this afternoon, please be careful with your drive. (Independence behavior detected, turn to “Chat” State) 11 User C: Yes, thank you, you are very nice. (Satisfy with the avatar’s answer) 12 Avatar: I am very glad that you are satisfied with my service. (Independence behavior detected, “Chat” state) 13 User C: Ok, see you. 14 Avatar: Bye. (Independence behavior detected, “Forget” state). pure speech conversation with no behavior analysis in DM; 2) speech conversation with user behavior analysis in DM, but “Forget” state is ignore in DM; 3) speech conversation with the proposed full BSDM (Fig. 5). As what we have introduced in previous section, total ten participants were asked to score the expressivity of HCI conversation system on a five level MOS scale. The MOS score range is [0, 5] (the higher the scores, the better the user experience). The MOS results are showed in Fig. 10. The average MOS scores of three sessions are 2.2 (Session (1)), 2.9 (session (2)), 3.7 (session (3)). The pure speech conversation without any behavior analysis in DM is always considered unnatural. Speech conversation with behavior analysis connected to the original DM in Fig. 4 obtains a higher score than that of previous one. When the STTD model is connected to the full DM, where “Forget” state is added for system-initiative topic transferring, the HCI dialog shows a nature and realistic conversation procedure and improves the MOS by 0.8 points over conversation style 2. It proves that user behavior analysis in dialog management with multi- modal history cues contribute to improve nature property of human-computer conversation.

5.6 Discussions

We mainly verify two points in the experiments: (1) the modified high-level facial features are effective for user behavior fusion and classification in the STTD model; (2) the proposed user behavior fusion model (complementation, conflict and independence) and the BSDM are effective for topic management in dialog process. First, we appended 16 high-level visual features based on the features propose in [42]. As the two-layer SVM fusion model [37] has been proven to be effective for emotion classifica- tion, we adopt the single layer SVMs to test the emotion classification on the feature set V1– V6,V7–V18,V1–V18. The experiments show that the recognition precision of V7–V18 almost is Multimed Tools Appl

Fig. 10 MOS scores of users’ interactive feeling with three kinds of DM

similar to that of V1–V6.AndifwecombineV1–V6 and V7–V18 together, the recognition precision improves about 2.5 % over those of V1–V6 and V7–V18 on facial emotion polarity classification. The experiments prove that the proposed high-level features are effective for facial emotion detection. V21–V22 is proposed for head gestures estimation. We combine these two features with prosody and linguistic features at the second stage of STTD model. With the definition of three user categories, we construct a multi-modal user behavior database. As dialog process mainly considers the user conflict behavior and dialog attitude for other two models, we verify the STTD model on the constructed multi-modal database. As the classification precisions for conflict behavior, positive and negative emotion achieve about 75–85 %, user behaviors could be taken as effective cues for later topic management with STTD model. The records of three users’ dialog with the traffic 3D talking avatar present direct analysis cues for behavior estimation and dialog states transfer. With the same beginning sentence started by users, they are asked to talk freely before microphone and camera with avatar. With the “trial-and-error” schema for conflict behavior and users’ behavior attitude estimation, BSDM lets avatar feel users’ behavior and emotion real-time and clarify user intention actively. An additional advantage of BSDM is that it could reduce the influence of ASR error in dialog process (for example the cases in the 7th and 8th rows of Table 9,where “TianAnMen” is wrongly recognized as “QianMen” because of user’s speech accent). With the conflict behavior detection, avatar presents user satisfactory answer at the third turn after clarification. Totally, the proposed BSDM helps to keep the human computer dialog process smooth and improves user interaction experience in comparison with those of tradition DMs without user behavior sensitive techniques (see section 5.4). Multimed Tools Appl

6Conclusion

In this study, we propose a multi-modal dialog management, which is sensitive to user behavior. We first define three kinds of user behavior. The different behavior type reflects user’stalking attitude for conversation. User behavior could be detected with high-level facial features, prosody, speech and linguistic cues with the STTD model. The experiments verified that the STTD model benefits for users’ intention classification and helps DM to determine topic maintenance or transfer among “Ask”, “Answer”, “Chat” and “Forget” states with mix-initiative strategy, which contributes to avatar’s flexible response for user interaction. Moreover, with user behavior detection real-time, the time for DM “Forget” state could be obtained when a new user is coming or user say good bye. In the practical traffic routine query system, we embodied slot-filling technique in DM to maintain the conversation information. With the user behavior analysis techniques, DM submit the departure place, destine place query to routine information system and generate appropriate response according to users’ positive or negative attitude. It ensures the dialog process to be sensitive to user behavior as well as to improve user interaction feeling. Based on the proposed BSDM framework, our future works focus on the following points. First, instead of positive and negative emotion in BSDM, we plan to include subtle response mechanism for more different emotions (such as “happiness”, “sadness”, “anger”,and“sur- prise”) in DM. Second, as user hand gesture or body movement also contains important interaction cues, we consider constructing a more comprehensive DM by importing body motion and gesture in BSDM. Finally, as the talking avatar application is used for a traffic routine system, the slot-filling technique is appropriate for information query. While for other avatar dialog application, such as virtual museum guide, language leaning, etc., it is an important work to apply and estimate different techniques’ effectiveness in BSDM, such as graphical model, reinforcement learning and POMDP, etc.

Acknowledgments This work is supported by the National Natural Science Foundation of China (NSFC) (No. 61273288, No. 61233009, No. 61203258, No. 61530503, No. 61332017, No. 61375027), and the Major Program for the National Social Science Fund of China (13&ZD189).

References

1. Ananova. http://en.wikipedia.org/wiki/Ananova. Accessed 18 Jan 2014; Available from: http://en.wikipedia. org/wiki/Ananova 2. Baltrušaitis T, Ramirez GA, Morency L-P (2011) Modeling latent discriminative dynamic of multi- dimensional affective signals. Affect Comput Intell Interact, p 396–406. Springer, Berlin 3. Bell L, Gustafson J (2000) Positive and negative user feedback in a spoken dialogue Corpus. In: INTERSPEECH. p 589–592 4. Bianchi-Berthouze N, Meng H (2011) Naturalistic affective expression classification by a multi-stage approach based on hidden Markov models. Affect Comput Intell Interact 3975:378–387 5. Bohus D, Rudnicky A (2005) Sorry, I didn’t catch that! - an investigation of non-understanding errors and recovery strategies. In: Proceedings of SIGdial. Lisbon, Portugal 6. Bousmalis K, Zafeiriou S, Morency L-P, Pantic M, Ghahramani Z (2013) Variational hidden conditional random fields with coupled Dirichlet process mixtures. In: European conference on machine learning and principles and practice of knowledge discovery in databases 7. Brustoloni JC (1991) Autonomous agents: characterization and requirements 8. Cerekovic TPA, Igor P (2009) RealActor: character animation and multimodal behavior realization system. Intelligent virtual agents. p 486–487. Multimed Tools Appl

9. Dobrisek S, Gajsek R, Mihelic F, Pavesic N, Struc V (2013) Towards efficient multi-modal emotion recognition. Int J Adv Robot Syst 10:1–10 10. Engwall O, Balter O (2007) Pronunciation feedback from real and virtual language teachers. J Comput Assist Lang Learn 20(3):235–262 11. Goddeau HMD, Poliforni J, Seneff S, Busayapongchait S (1996) A form-based dialogue management for spoken language applications. In: International conference on spoken language processing. Pittsburgh, PA. p 701–704 12. GoogleAPI. www.google.com/intl/en/chrome/demos/speech.heml. Available from: www.google.com/intl/en/ chrome/demos/speech.heml 13. Heloir A, Kipp M, Gebhard P, Schroeder M (2010) Realizing Multimodal Behavior: Closing the gap between behavior planning and embodied agent presentation. In: Proceedings of the 10th international conference on intelligent virtual agents. Springer 14. Hjalmarsson A, Wik P (2009) Embodied conversational agents in computer assisted language learning. Speech Comm 51(10):1024–1037 15. Jones MJ, Viola PA (2004) Robust real-time face detection. Int J Comput Vis 57(2):137–154 16. Kaiser M, Willmer M, Eyben F, Schuller B (2012) LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis Comput 31:153–163 17. Kang Y, Tao J (2005) Features importance analysis for emotion speech classification. In: International conference on affective computing and intelligence interaction -ACII 2005. p 449–457 18. kth. http://www.speech.kth.se/multimodal/. Accessed 18 Jan 2014; Available from: http://www.speech.kth. se/multimodal/ 19. Lee C, Jung S, Kim K, Lee D, Lee GG (2010) Recent approaches to dialog management for spoken dialog systems. J Comput Sci Eng 4(1):1–22 20. Levin RPE, Eckert W (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans Speech Audio Process 8(1):11–23 21. Litman DJ, Tetreault JR (2006) Comparing the utility of state features in spoken dialogue using reinforce- ment learning. In: Conference: North American Chapter of the association for computational linguistics - NAACL. New York City 22. MapAPIBaidu. http://developer.baidu.com/map/webservice.htm. Available from: http://developer.baidu. com/map/webservice.htm 23. McKeown G, Valstar MF, Cowie R, Pantic M (2010) The SEMAINE Corpus of emotionally coloured character interactions. In: Proc IEEE Int’l Conf Multimedia and Expo. p 1079–1084 24. mmdagent. http://www.mmdagent.jp/. Accessed 18 Jan 2014; Available from: http://www.mmdagent.jp/ 25. nlprFace. http://www.cripac.ia.ac.cn/Databases/databases.html and http://www.idealtest.org/ dbDetailForUser.do?id=9. Accessed 18 Jan 2014; Available from: http://www.cripac.ia.ac.cn/Databases/ databases.html and http://www.idealtest.org/dbDetailForUser.do?id=9 26. Pietquin TDO (2006) A probabilistic framework for dialog simulation and optimal strategy. IEEE Trans Audio Speech Lang Process 14(2):589–599 27. Rebillat M, Courgeon M, Katz B, Clavel C, Martin J-C (2010) Life-sized audiovisual spatial social scenes with multiple characters: MARC SMART-I2. In: 5th meeting of the French association for virtual reality 28. Schatzmann KWJ, Stuttle M, Young S (2006) A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowl Eng Rev 21(2):97–126 29. Schels M, Glodek M, Meudt S, Scherer S, Schmidt M, Layher G, Tschechne S, Brosch T, Hrabal D, Walter S, Traue HC, Palm G, Schwenker F, Campbell MR (2013) Multi-modal classifier-fusion for the recognition of emotions, chapter in converbal synchrony in human-machine interaction. CRC Press, Boca Raton, FL 33487, USA 30. Scherer KR (1999) Appraisal theory. In: Dalgleish T, Power M (eds) Handbook of cognition and emotion. Wiley, Chichester 31. Schwarzlery SMS, Schenk J, Wallhoff F, Rigoll G (2009) Using graphical models for mixed-initiative dialog management systems with realtime Policies. In: Conference: annual conference of the International Speech Communication Association - INTERSPEECH. p 260–263 32. Shan S, Niu Z, Chen X (2009) Facial shape localization using probability gradient hints. IEEE Signal Process Lett 16(10):897–900 33. SPTK. http://sp-tk.sourceforge.jp. Accessed 18 Jan 2014; Available from: http://sp-tk.sourceforge.jp 34. Steedman M, Badler N, Achorn B, Bechet T, Douville B, Prevost S, Cassell J, Pelachaud C, Stone M (1994) Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversation agents. In: Proceedings of SIGGRAPH. p 73–80 35. Tao J, Pan S, Yang M, Li Y, Mu K, Che J (2011) Utterance independent bimodal emotion recognition in spontaneous communication. EURASIP J Adv Signal Process 11(1):1–11 36. Tao J, Yang M, Mu K, Li Y, Che J (2012) A multimodal approach of generating 3D human-like talking agent. J Multimodal User Interfaces 5(1–2):61–68 Multimed Tools Appl

37. Tao J, Yang M, Chao L (2013) Combining emotional history through multimodal fusion methods. In: Asia Pacific Signal and Information Processing Association (APSIPA 2013). Kaohsiung, Taiwan, China 38. Tschechne S, Glodek M, Layher G, Schels M, Brosch T, Scherer S, Schwenker F (2011) Multiple classifier systems for the classification of audio-visual emotion states. Affect Comput Intell Interact, 378–387. Springer, Berlin 39. Van Reidsma D, Welbergen H, Ruttkay ZM, Zwiers Elckerlyc J (2010) A BML Realizer for continuous, multimodal interaction with a Virtual Human. J Multimodal User Interfaces 3(4):271–284 40. Williams JD (2003) A probabilistic model of human/computer dialogue with application to a partially observable Markov decision process 41. Williams JD, Poupart P, Young S (2005) Partially observable Markov decision processes with continuous observations for dialogue management. In: Proceedings of the 6th SigDial workshop on discourse and dialogue. Lisbon 42. Xin L, Huang L, Zhao L, Tao J (2007) Combining audio and video by dominance in bimodal emotion recognition. In: International conference on affective computing and intelligence interaction - ACII. p 729–730 43. Young S (2006) Using POMDPs for dialog management. In: Conference: IEEE workshop on spoken language technology - SLT 44. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

Minghao Yang received his Ph.D from Institute of Automation, Chinese Academy of Sciences in 2010, and received his MS degree from the Department of Computer Animation Technique, Communication University of China in 2006. He is currently working as an Assistant Professor in National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include Speech Generation Visualization, Multimodal Human-Interaction and Virtual Reality. He got the 2013 best paper award from the conferences NCMMSC and published over 20 papers on JMUI, JCAD, ICMI、ICASSP, ACII and other important international journals or conferences. Multimed Tools Appl

Jianhua Tao received his PhD from Tsinghua University in 2001, and got his Ms from University in 1996. He is currently a Professor in NLPR, Institute of Automation, Chinese Academy of Sciences. His current research interests include speech synthesis and coding methods, human computer interaction, multimedia information processing and pattern recognition. He has published more than eighty papers on major journals and proceedings including IEEE Trans. on ASLP, and got several awards from the important conferences, such as Eurospeech, NCMMSC, etc. He serves as the chair or program committee member for several major conferences, including ICPR, ACII, ICMI, ISCSLP, NCMMSC etc. He also serves as the steering committee member for IEEE Transactions on Affective Computing, associate editor for Journal on Multimodal User Interface and International Journal on Synthetic Emotions, Deputy Editor-in-chief for Chinese Journal of Phonetics.

Linlin Chao received the BS in electronic information engineering from Jilin University, China, in 2011. Currently, he is working toward the Ph.D in pattern recognition and intelligent systems at the Institute of Automation of Chinese Academy of Sciences, Beijing, China. His research interests include pattern recognition, machine learning and especially multimodal interaction and affective computing. Multimed Tools Appl

Hao Li received his BS in automationfrom Xidian University, China, in 2010. Currently, he is working toward the PHD degree in pattern recognition and intelligent systems at the Institute of Automation of Chinese Academy of Sciences, Beijing, China. His research interests include pattern recognition, machine learning and especially talking head, visual speech synthesis and multimodal interaction.

Dawei Zhang received his Bs from China University of Mining & Technology in 2012. He is currently pursuing the PhD in National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. His current research interests include human-computer interaction, multi-model information fusion and pattern recognition.

Hao Che received the Master in software engineering from University of Science and Technology of China, in 2007. He received the BS in automatic control from Huazhong University of Science & Technology, in 2004. Multimed Tools Appl

Currently, he is working toward the PHD degree in pattern recognition and intelligent systems at the Institute of Automation of Chinese Academy of Sciences, Beijing, China. His research interests include speech synthesis, speech recognition and especially speech prosody analysis.

Tingli Gao received his Ms from CASS Graduate School in 2012.She is currently a Assistant Engineer in NLPR, Institute of Automation, Chinese Academy of Sciences. Her current research interests include human computer interaction, natural language processing. She has involved in a lot of projects and systems development work in NLPR, research involved in a lot of directions, such as speech synthesis, speech recognition, human-computer dialogue system and so on. She has published papers on Journal of Chinese Information, and participated in the academic conference.

Bin Liu received his the B.S. degree and the M.S. degree from Beijing institute of technology (BIT), Beijing, in 2007 and 2009 respectively. He is currently pursuing the PhD in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing. He current research interests include low bit rate speech coding and single channel speech enhancement.