Applying Psychology of Persuasion to Conversational Agents Through Reinforcement Learning: an Exploratory Study

Applying Psychology of Persuasion to Conversational Agents through Reinforcement Learning: an Exploratory Study Francesca Di Massimo1, Valentina Carfora2, Patrizia Catellani2 and Marco Piastra1 1Computer Vision and Multimedia Lab, Università degli Studi di Pavia, Italy 2 Dipartimento di Psicologia, Università Cattolica di Milano, Italy [email protected] [email protected] [email protected] [email protected]

Abstract manager via machine learning techniques, reinforcement learning (RL) in particular, may seem This study is set in the framework of task- attractive. At present, many RL-based approaches oriented conversational agents in which involve training an agent end-to-end from a dataset dialogue management is obtained via Re- of recorded dialogues, see for instance Liu (2018). inforcement Learning. The aim is to ex- However, the chance of obtaining significant re- plore the possibility to overcome the typ- sults in this way entails substantial efforts in both ical end-to-end training approach through collecting sample data and performing experi- the integration of a quantitative model de- ments. Worse yet, such efforts ought to rely on the veloped in the field of persuasion psychol- even stronger hypothesis that the RL agent would ogy. Such integration is expected to accel- be able to elicit psychosocial aspects on its own. erate the training phase and improve the As an alternative, in this study we envisage the quality of the dialogue obtained. In this possibility to enhance the RL process by harness- way, the resulting agent would take advan- ing a model developed and accepted in the field tage of some subtle psychological aspects of social psychology to provide a more reliable of the interaction that would be difficult to learning ground and a substantial accelerator for elicit via end-to-end training. We propose the process itself. a theoretical architecture in which the psy- Our study relies on a quantitative, causal model chological model above is translated into a of human behavior being studied in the field of so- probabilistic predictor and then integrated cial psychology (see Carfora et al., 2019) aimed at in the reinforcement learning process, in- assessing the effectiveness of message framing to tended in its partially observable variant. induce healthier nutritional habits. The goal of the The experimental validation of the archi- model is to assess whether messages with different tecture proposed is currently ongoing. frames can be differentially persuasive according to the users’ psychosocial characteristics. 1 Introduction 2 Psychological model: Structural A typical conversational agent has a multi-stage Equation Model architecture: spoken language, written language and dialogue management, see Allen et al. (2001). Three relevant psychosocial antecedents of be- This study focuses on dialogue management for haviour change are the following: Self-Efficacy task-oriented conversational agents. In particular, (the individual perception of being able to eat we focus on the creation of a dialogue manager healthy), Attitude (the individual evaluation of the aimed at inducing healthier nutritional habits in pros and cons) and Intention Change (the indi- the interactant. vidual willingness of adhering to a healthy diet). Given that the task considered involves psy- These psychosocial dimensions cannot be directly chosocial aspects that are difficult to program di- observed and need to be measured as latent vari- rectly, the idea of achieving an effective dialogue ables. To this purpose, questionnaires are used, each composed by a set of questions or items (i.e. observed variables). Self-Efficacy is mea- Copyright c 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- sured with 8 items, each associated to a set of ternational (CC BY 4.0). answers ranging from "not at all confident" (1) Figure 1: SEM simplified model for the case at hand.

Figure 2: DBN translation of the SEM shown in Figure 1. to "extremely confident" (7). Attitude is assessed tude and Intention Change in relation to healthy through 8 items associated to a differential scale eating. ranging from 1 to 7 (the higher the score, the more The overall model is described by the Struc- positive the attitude). Intention Change is mea- tural Equation Model (SEM, see Wright, 1921) sured with three items on a Likert scale, ranging in Figure 1. For simplicity, only three items are from 1 (“definitely do not”) to 7 (“definitely do”). shown for each latent variable. Besides allow- See Carfora et el. (2019). ing the description of latent variables, SEMs are In our study, the psychosocial model was as- causal models in the sense that they allow a sta- sessed experimentally on a group of volunteers. tistical analysis of the strength of causal relations Each participant was first proposed a question- among the latents themselves, as represented by naire (Time 1 – T1) for measuring Self-Efficacy, the arrows in figure. SEMs are linear models, and Attitude and Intention Change. In a subsequent thus all causal relations underpin linear equations. phase (i.e. message intervention), participants Note that latent variables in a SEM have dif- were randomly assigned to one of four groups, ferent roles: in this case gain/non-gain/loss/non- each receiving a different type of persuasive mes- loss messages are independent variables, Intention sage: gain (i.e. positive behavior leads to posi- Change is a dependent variable, Attitude is a me- tive outcomes), non-gain (negative behavior pre- diator of the relationship between the independent vents positive outcomes), loss (negative behavior and the dependent variables, and Self-Efficacy is a leads to negative outcomes) and non-loss (posi- moderator, namely, it explains the intensity ot the tive behavior prevents negative outcomes) (Hig- relation it points at. Intention Change was mea- gins, 1997; Cesario et al., 2013). In a last phase sures at both T1 and T2, Attitude was measured at (Time 2 - T2), the effectiveness of the message in- both T1 and T2, and Self-Efficacy was measured at tervention was then evaluated with a second ques- T1 only. Note that the time transversality (i.e. T1 tionnaire, to detect changes in participants’ Atti- → T2) is implicit in the SEM depiction above. 3 Probabilistic model: Bayesian Network Once the SEM is defined, we aim to translate it into a probabilistic model, so as to obtain the probability distributions needed for the learning process. We resort to a graphical model, and in particular to a Bayesian Network (BN, see Ben Gal, 2007), namely a graph-based description of Figure 3: Basic example of computation of Vπ in both the observable and latent random variables in a case where S = {s1, s2}. p1, p2, p3 are three the model and their conditional dependencies. In possible policies. BNs, nodes represent the variables and edges represent dependencies between them, whereas the lack of edges implies their independence, hence (1 to 2); medium := (3 to 5); high := (6 to 7). a simplification in the model. As a general rule, Our aim is to learn the Joint Probability Distri- the joint probability of a BN can be inferred as bution (JPD) of our model, as that would make us follows: able to answer, through marginalizations and conditional probabilities, any query about the model N Y itself. The conditional probability distributions to P (X1,...,XN ) = P (Xi | parents(Xi)), be learnt in the case in point are then the follow- i=1 ing: where X1,...,XN are the random variables in • P (Item Ai), for i = 1,..., 8; the model and parents(Xi) indicate all the nodes • P (Item SEi), for i = 1,..., 8; having an edge oriented towards Xi. In the case at hand, a temporal description of • P (Message Type); the model, accounting for the time steps T1 and • P (Attitude T 1 | Item Ai, i = 1,..., 8); T2, is necessary as well. For this purpose, we use a Dynamic Bayesian Network (DBN, see Dagum • P (Self-Efficacy | Item SEi, i = 1,..., 8); et al., 1992). The DBN thus obtained is shown in • P (Attitude T 2 | Item Ai, i = 1,..., 8, Figure 2. Message Type, Self-Efficacy); Notice that the messages are only significant at T2, as they have not been sent yet at T1. We gath- • P (Intention Change | Attitude T 2, ered message in the one node Message Type, as- Self-Efficacy). suming it can take four, mutually exclusive values. The first three can be easily inferred from the raw The mediator Attitude is measured at both time data as relative frequencies. As for the following steps while the moderator Self-Efficacy is constant four, even aggregating the 7 values as mentioned, over time, as suggested in Section 2. Intention a huge amount of data would still be necessary Change has relevance at T2 only since, as we will (38 ·24 ·3 = 314.928 subspaces for Attitude T2, for mention in Section 5, it will be used to estimate a instance). As conducting a psychological study on reward function once the final time step is reached. that amount of people would not be feasible, we address the issue with an appropriate choice of the 4 Learning the BN method. To allow using Maximum Likelihood Es- The collected data are as follows. The analysis timation (MLE) to learn the BN, we resort to the was conducted on 442 interactants, divided in four Noisy-OR approximation (see Onisko,´ 2001). Ac- groups, each one receiving a different type of mes- cording to this, through a few appropriate changes sages1. The answers to the items of the ques- (not shown) to the graphical model, the number of tionnaire always had a range of 7 values. How- subspaces can be greatly reduced (e.g. 3·2·3 = 18 ever, this induces a combinatory esplosion, mak- for Attitude T2). ing it impossible to cover all the subspaces (78 = 5.764.801 different combinations for Attitude, for 5 Reinforcement Learning: Markov instance). We thus decide to aggregate: low := Decision Problems

1The original study included also a control group, which The translation into a tool to be used for reinforce- we do not consider here. ment learning is obtained in the terms of Markov Decision Processes (MDPs), see Fabiani et al. In the POMDP framework, the agent’s choices (2010). about how to behave are influenced by its belief Roughly speaking, in a MDP there is a finite state and by the history. Thus, we define the number of situations or states of the environment, agent’s policy: at each of which the agent is supposed to select an action to take, thus inducing a state transition and π = π(bt, ht), obtaining a reward. The objective is to find a pol- that we aim to optimize. To complete the picture, icy determining the sequence of actions that gen- we define the following functions to describe the erates the maximum possible cumulative reward, model evolution in time (the notation 0 indicates a over time. However, due to the presence of latents, reference to the subsequent time step): in our case the agent is not able to have complete state-transition function: knowledge about the state of the environment. In T :(s, a) 7→ P (s0 | s, a) := T (s0, s, a); such a situation, the agent must build its own estimate about the current state based on the memory observation function: 0 0 0 0 of past actions and observations. This entails using O :(s, a) 7→ P (o | a, s ) := O(o , a, s ); a variant of the MDPs, that is Partially Observable reward function: Markov Decision Processes (POMDPs, see Kael- R:(s, a) 7→ E [r0 | s, a] := R(s, a). bling 1998). We then define the following, with These functions can be easily adapted to the reference to the variables mentioned in Figure 2: specifics of the case at hand. It can be seen that, S := {states} = {Attitude T 2, Self-Efficacy}; once the JPD derived from the DBN is completely A := {actions} = {ask A1,..., ask A8} ∪ specified, the reward is deterministic. In particu- {ask SE1,..., ask SE8} ∪ {G, NG, L, NL}, lar, it is computed by evaluating the changes in the values for the latent Intention Change. where Ai denotes the question for Item Ai, As we are interested in finding an optimal pol- SEi denotes the question for Item SEi and icy, we now need to evaluate the goodness of each G, NG, L, NL denote the action of sending Gain, state when following a given policy. As there is Non-gain, Loss and Non-loss messages respec- no certainty about the states, we define the value tively; function as a weighted average over the possible Ω := {observations} = belief states: {Item A1,..., Item A8, Item SE1,..., Item SE8}. X Vπ(bt, ht) := bt(st)Vπ(st, bt, ht), Starting from an unknown initial state s0 (often st taken to be uniform over S, as no information is available), the agent takes an action a0, that brings where Vπ(st, bt, ht) is the state value function. it, at time step 1, to state s1, unknown as well. The latter depends on the expected reward (and on There, an observation o1 is made. a discount factor γ ∈ [0, 1] stating the preference The process is then repeated over time, until a for fast solutions): goal state of some kind has been reached. Hence, V (s , b , h ) :=R(s , π(b , h )) + we can define the history as an ordered succession π t t t t t t X of actions and observations: γ T (st+1,st, π(bt, ht)) ∗ st+1 ht := {a0, o1, . . . , at−1, ot} , h0 = ∅. X O(ot+1, π(bt, ht),st+1)Vπ(st+1, bt+1, ht+1). As at all steps there is uncertainty about the ac- ot+1 tual state, a crucial role is played by the agent’s Finally, we define the target of our seek, namely estimate about the state of the environment, i.e. by the optimal value function and the related optimal the belief state. The agent’s belief at time step t, policy, as: denoted as bt, is driven by its previous belief bt−1 ( ∗ and by the new information acquired, i.e. the ac- V (bt, ht) := maxπ Vπ(bt, ht), ∗ tion taken at−1 and observation made ot. We then π (bt, ht) := argmaxπVπ(bt, ht). have: It can be shown that the optimal value function in a bt+1(st+1) = P (st+1 | bt, at, ot+1). POMDP is always piecewise linear and convex, as Figure 4: Expansion of the policy tree. l, m, h stand for low, medium and high. exemplified in Figure 3. In other words, the opti- timal policies. These latter will be applied by the mal policy (in bold in Figure 3) combines different conversational agent in the interaction with each policies depending on their belief state values. specific user, to adapt both the sequence and the The next step is to use the POMDP to detect the amount of questions to her/his personality profile optimal policy, that is the sequence of questions to and selecting the message which is most likely to ask to the interactant, in order to draw her/his pro- be effective. file, hence the message to send, which maximizes the effectiveness of the interaction. To this end, the 7 Conclusions and future work contribution of the DBN is fundamental. From the JPD associated, in fact, we construct the probabil- In this work we explored the possibility of har- ity distributions necessary to define the functions nessing a complete and experimentally assessed T , O, R that compose the value function. SEM, developed in the field of persuasion psychology, as the basis for the reinforcement learn- 6 Policy from Monte Carlo Tree Search ing of a dialogue manager that drives a conversa- It is evident from Figure 4, describing the full ex- tional agent whose task is inducing healthier nu- pansion of the policy tree for the case in point, tritional habits in the interactant. The fundamen- that the computational effort and power required tal component of the method proposed is a DBN, for a brute-force exploration of all possible com- which is derived from the SEM above and acts like binations is unaffordable. a predictor for the belief state value in a POMDP. Among all the policies that can be considered, The main expected advantage is that, by doing we want to select the optimal ones, thus avoid- so, the RL agent will not need a time-consuming ing coinsidering policies that are always underper- period of training, possibly requiring the involve- forming. In other words, with reference to Fig- ment of human interactants, but can be trained ‘in house’ – at least at the beginning – and be released ure 3, we want to find Vp1 , Vp2 , Vp3 among those of all possible policies, and use them to identify in production at a later stage, once a first effec- the optimal policy V ∗. tive strategy has been achieved through the DBN. To accomplish this, we select the Monte Carlo Such method still requires an experimental valida- Tree Search (MCTS) approach, see Chaslot et al. tion, which is the current objective of our working (2008), due to its reliability and its applicability to group. computationally complex practical problems. We adopt the variant including an Upper Confidence Acknowledgments Bound formula, see Kocsis et al. (2006). This method combines exploitation of the previously The authors are grateful to Cristiano Chesi of computed results, allowing to select the game ac- IUSS Pavia for his revision of an earlier version tion leading to better results, with exploration of of the paper and his precious remarks. We also different choices, to cope with the uncertainty of acknowledge the fundamental help given by Re- the evaluation. Thus, using Vπ(st, bt, ht) as de- becca Rastelli, during her collaboration to this re- fined before to guide the exploration, the MCTS search. method reliably converges (in probability) to op- References Gupta, Sumeet & W. Kim, Hee. 2008. Linking structural equation modeling to Bayesian networks: Allen, J., Ferguson, G., & Stent, A. 2001. An archi- Decision support for customer retention in virtual tecture for more realistic conversational systems. In communities. European Journal of Operational Re- Proceedings of the 6th international conference on search. 190. 818-833. Intelligent user interfaces (pp. 1-8). ACM. Heckerman, David. 1995. A Bayesian Approach to Anderson, Ronald D. & Vastag, Gyula. 2004. Learning Causal Networks. Causal modeling alternatives in operations research: Overview and application. European Jour- Higgins, E.T. 1997. Beyond pleasure and pain. Amer- nal of Operational Research. 156. 92-109. ican Psychologist, 52, 1280-1300. A. Howard, Ronald. 1972. Dynamic Programming Auer, Peter & Cesa-Bianchi, Nicolò & Fischer, Paul. and Markov Process. The Mathematical Gazette. 2002Kocsis, Levente & Szepesvári, Csaba. 2006. 46. Bandit Based Monte-Carlo Planning. Finite-time Analysis of the Multiarmed Bandit Problem. Ma- Pack Kaelbling, Leslie & Littman, Michael & R. Cas- chine Learning. 47. 235-256. sandra, Anthony. 1998. Planning and Acting in Par- tially Observable Stochastic Domains. Artificial In- Bandura, A. 1982. Self-efficacy mechanism in human telligence. 101. 99-134. agency. American Psychologist, 37, 122-147. Kocsis, Levente & Szepesvári, Csaba. 2006. Bandit Baron, Robert A. & Byrne, Donn Erwin & Suls, Jerry Based Monte-Carlo Planning. Machine Learning: M. 1989. Exploring social psychology, 3rd ed. ECML 2006. Springer Berlin Heidelberg. 282-293. Boston, Mass.: Allyn and Bacon. 0205119085. Lai, T.L & Robbins, Herbert. 1985. Asymptotically Ef- Ben Gal I. 2007. Bayesian Networks. Encyclopedia of ficient Adaptive Allocation Rules. Advances in Ap- Statistics in Quality and Reliability. John Wiley & plied Mathematics. 6. 4-22. Sons. Liu, Bing. 2018. Learning Task-Oriented Dialog with Neural Network Methods. PhD thesis. Bertolotti, M., Carfora, V., & Catellani, P. 2019. Dif- ferent frames to reduce red meat intake: The moder- Murphy, Kevin. 2012. Machine Learning: A Proba- ating role of self-efficacy. Health Communication, bilistic Perspective. The MIT Press. 58. in press. Pearl Judea. 1988. Probabilistic Reasoning in In- Carfora, V., Bertolotti, M., & Catellani, P. 2019. In- telligent Systems: Networks of Plausible Inference. formational and emotional daily messages to reduce Representation and Reasoning Series (2nd printing red and processed meat consumption. Appetite, ed.). San Francisco, California: Morgan Kaufmann. 141, 104331. Onisko,´ Agnieszka & Druzdzel, Marek J. & Wasyluk, Learning Bayesian network parame- Cesario, J., Corker, K. S., & Jelinek, S. 2013. A self- Hanna. 2001. ters from small data sets: application of Noisy-OR regulatory framework for message framing. Journal gates. of Experimental Social Psychology, 49, 238-249. International Journal of Approximate Rea- soning. 27. Chaslot, Guillaume & Bakkes, Sander & Szita, Istvan Silver, David & Veness, Joel. 2010. Monte-Carlo & Spronck, Pieter. 2008. Monte-Carlo Tree Search: Planning in Large POMDPs. Advances in Neural A New Framework for Game AI. Bijdragen. Information Processing Systems. 23. 2164-2172. Dagum, Paul and Galper, Adam and Horvitz, Eric. Matthijs T. J. Spaan. 2012. Partially Observable 1992. Dynamic Network Models for Forecasting. Markov Decision Processes. In: Reinforcement Proceedings of the Eighth Conference on Uncer- Learning: State of the Art. Springer Verlag. 387- tainty in Artificial Intelligence. 414.

Dagum, Paul and Galper, Adam and Horvitz, Eric and Sutton, Richard & G. Barto, Andrew. 1998. Reinforce- Seiver, Adam. 1999. Uncertain reasoning and fore- ment Learning: An Introduction. IEEE transactions casting.International Journal of Forecasting. on neural networks / a publication of the IEEE Neu- ral Networks Council. 9. 1054. De Waal, Alta & Yoo, Keunyoung. 2018. Latent Vari- Wright, Sewall. 1921. Correlation and causation. able Bayesian Networks Constructed Using Struc- Journal of Agricultural Research. 20. 557–585. tural Equation Modelling. 2018 21st International Conference on Information Fusion. 688-695. Young, Steve & Gasic, Milica & Thomson, Blaise & Williams, Jason. 2013. POMDP-based statistical Fabiani, Patrick & Teichteil-Königsbuch, Florent. spoken dialog systems: A review. Proceedings of 2010. Markov Decision Processes in Artiﬁcial In- the IEEE, 101. 1160-1179. telligence. Wiley-ISTE.