Self-learning algorithms for the personalized interaction with people with dementia

Bram Steenwinckel

Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Dr. ir. Femke De Backere, Ir. Jelle Nelis

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2016-2017

Self-learning algorithms for the personalized interaction with people with dementia

Bram Steenwinckel

Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Dr. ir. Femke De Backere, Ir. Jelle Nelis

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2016-2017 PREFACE

In the last couple of years, my interests in healthcare were enlarged by the several summer jobs I did, within this sector. It was there that I saw the need for more computational aids in the battle against various diseases. This same need is noticeable in the problem description of this dissertation. Being able to design possible solutions was, therefore, a privilege, honour and motivator. ”Self-learning algorithms for the personalized interaction with people with dementia” has been written in order to obtain the academic degree of Master of Science in Computer Science Engineering at the University of Ghent. I was engaged in researching and writing this dissertation from September 2016 till June 2017. After an intensive period of eight months, the last writings of this dissertation belongs to this note of thanks. During this journey of intensive learning, many people supported and helped me to achieve the findings stated in this master thesis. I think words of thanks are in order to these people. I would first like to thank my supervisors Prof. dr. ir. De Turck Filip and Dr. Ongenae Femke of the Departement of Information Technology at the University of Ghent. Prof. De Turck gave some valuable feedback for the different concepts in this dissertation and enlarged my knowledge in computer science during my full master program. Dr. Ongenae was always available whenever I had a question about my research and steered me in the right the direction whenever needed during multiple interactive sessions. I would also like to thank the researchers who gave some additional knowledge about the different concepts discussed in this dissertation: Ir. Bohez Steven, Ir. Mahieu Christof and Ir. Nelis Jelle, who helped with the fundamental concepts of reinforcement learning, the Nao robotic interactions and the design of external sensors. Also, I would like to thank Schaballie Jeroen and De Pestel Stijn. My research would not have been possible without their help. I would particularly like to single out my counsellor Dr. ir. De Backere Femke, I want to thank you for all the provided feedback and the encouraging words when needed. I would also like to acknowledge a friend, Joris Heyse, who tested my designed application. We were not only able to support each other by deliberating over our problems and findings on reinforcement learning techniques but also happily by talking about things other than just our papers. Finally, I must express my very profound gratitude to my parents, sister Lien and my lovely Laura for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this dissertation. This accomplishment would not have been possible without you. Thank you. Bram Steenwinckel

i The author gives permission to make this master dissertation available for consultation and to copy parts of this mas- ter dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.

Gent, 2 July 2017.

ii Self-learning algorithms for the personalized interaction with people with dementia Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Dr. ir. Femke De Backere, Ir. Jelle Nelis Department of Information Technology Chair: Prof. dr. ir. Bart Dhoedt Faculty of Engineering and Architecture Academic year 2016-2017

Abstract: The number of people with dementia (PwD) residing in nursing homes (NH) increases rapidly. Behavioural disturbances (BDs) such as wandering and aggressions are the main reasons to hospitalise these people. Social robots could help to resolve these BDs by performing simple interactions with the patients. The WONDER project investi- gates the necessary functionality to have such a robot autonomously walking from one resident to another, each time engaging in a personalised interaction. This paper examines whether self-learning algorithms could be designed to select the robotic interactions, preferred by the patients, during these interventions. K-armed bandit algorithms were compared in simulated environments for single and multiple patients to find the beneficial learning agents and ac- tion selection policies. The single patient tests show the advantages of selecting actions according to an UCB policy, while the multi-patient tests analyse the benefits of using additional, contextual information. Afterwards, the learn- ing application was provided with a framework to operate in more realistic situations. Tests with real PwD will still be needed before this designed learning application can be integrated within the full WONDER system. Keywords: Robot-Assisted Intervention, People with Dementia, Personalised interaction, Bandit algorithms

iii Self-learning algorithms for the personalized interaction with people with dementia Bram Steenwinckel Supervisors: Prof. dr. ir. Filip De Turck, Dr. Femke Ongenae Counsellors: Dr. ir. Femke De Backere, Ir. Jelle Nelis

Abstract— The number of people with dementia (PwD) resid- into the daily care processes for the prevention and alleviation ing in nursing homes (NH) increases rapidly. Behavioural dis- of BDs. WONDER will research the necessary functionality turbances (BDs) such as wandering and aggressions are the main to have the robot autonomously walking from one resident to reasons to hospitalise these people. Social robots could help to another, each time engaging in a personalised interaction, for resolve these BDs by performing simple interactions with the pa- tients. The WONDER project investigates the necessary func- example, playing a favourite song or asking questions about tionality to have such a robot autonomously walking from one memorable events in the lifetime of the PwD [10]. The main resident to another, each time engaging in a personalised inter- idea behind these interactions is that Zora will generate stim- action. This paper examines whether self-learning algorithms uli to elicit personal memories with associated positive feelings could be designed to select the robotic interactions, preferred by that have calming and reassuring effects onto these PwD. the patients, during these interventions. K-armed bandit algo- rithms were compared in simulated environments for single and multiple patients to find the beneficial learning agents and action selection policies. The single patient tests show the advantages of selecting actions according to an UCB policy, while the multi- patient tests analyse the benefits of using additional, contextual information. Afterwards, the learning application was provided with a framework to operate in more realistic situations. Tests with real PwD will still be needed before this designed learning application can be integrated within the full WONDER system.

Keywords— Robot-Assisted Intervention, People with Demen- tia, Personalised interaction, Bandit algorithms

I. Introduction Worldwide, almost 44 million people have dementia-related diseases where it is the most common in Western Europe [1]. Approximately 43% of these people with dementia (PwD) are staying at nursing homes (NH), rising to 76% of those with ad- vanced dementia [2]. Beside the amnesia, all these PwD suffer Fig. 1. Conceptual architecture of the WONDER project from so-called behavioural disturbances (BDs) like mood dis- orders, hallucinations, wandering and aggressions. Pharmaco- The WONDER system will let the care coordinator create logical interventions are used only for acute situations in the profiles with personal information about the residents together management of these BDs because these treatments do not ad- with organisational data, such as NH map and activity timeta- dress the underlying psychosocial reasons and may have adverse bles. Combined with information provided by the available sen- side effects [3]. Many different non-pharmacological therapies sors in the NH, and the customised wearables of the patients, a are designed to resolve specific BDs by interacting with the per-resident intervention strategy is determined. In acute situ- PwD and without the harmful effects of medical interventions ations, the robot can be sent immediately to distract the PwD [4]. However, these therapy sessions are more time consuming, temporarily. Pro-active interventions are scheduled collabora- and due to the increased strain on the available resources within tively over multiple robots throughout the day and night, taking healthcare, many NH avoid these therapies. into account the limitations of the robot. Robots can help the nursing staff to elevate several BDs, re- When the robot operates entirely autonomously, the effects ceiving the same benefits of the non-pharmacological therapies of the elicited personal memories should be analysed to make and reduce the burden and stress of the caregivers. Many differ- sure the executed personalised interaction had a positive impact ent studies investigated the effects of Robot-Assisted Therapies on the PwD. There could be for instance several interactions (RAT) onto these PwD, and neuropsychiatric symptoms tend to which have a negative or no effect on this patient and such improve when robots are involved in simple interactions, for ex- interactions should be avoided. The influences of the actions ample, storytelling, singing a song or performing a dance [5–8]. can change over time and are different for every patient. The However, these robots will only interact in a preprogrammed robot should somehow be able to ‘learn’ which interactions are manner, and the nursing staff still needs to install and control the most preferable, for multiple situations. these therapy sessions [8]. The aim of this work is to investigate how a learning system Caregivers are convinced about these robotic interactions can be built to determine which action should be executed to but suggested robotic-assisted interventions rather than ther- elevate a particular BD for a specific PwD. The learning al- apy sessions. IMEC currently designs such a robotic interven- gorithms are based on reinforcement learning, and Section II tion system in its WONDER project where Zora, which is a care will give a summary of this learning technique. In Section III, application build on top of the Nao robot [9], could be integrated the results of Section II are used to design the problem-specific bandit algorithms. Several simulations with virtual patients and a virtual robot were performed to investigate the perfor- mance of these bandit algorithms. The learning components were optimised and eventually surrounded by a framework to perform more practical tests, with real people and a real robot. Section IV gives more details about the simulator and the de- signed framework. The results of the tests, for different situa- tions and both the simulated and more realistic environment, are discussed in Section V.

II. Background The process, resulting in positive interactions, should learn similar like we, humans do. When we learn to ride a bike, no clear description is given of how we should perform. We just tried and, more than likely, we fell or at least stopped abruptly Fig. 2. Top: bandit problem, where only one action effect the reward. Middle: Contextual bandit problem, where state and action effect the and had to catch ourselves. The situation of stumbling occurred reward. Bottom: Full reinforcement learning problem, where action until we got some little success after riding for a few meters effects multiple state, and rewards may be delayed in time [15] before we fell again. During this learning process, the feedback signals that told us how well we did, were either pain or reward and were generated by our brain and how the environment, B. (Contextual) Bandit problem for instance, our parents reacted during this process [11]. This feedback is considered “reinforcement” for doing or not doing a When an agent has to act in only one situation, the anal- particular small action before receiving a much larger reward. ogy with a simple casino slot machine, where a gambler tries This same technique is applicable in the field of robotics and to maximise his or her revenue, is easily made. Consecutive is their more commonly known as reinforcement learning (RL). handle pulls will eventually reveal some knowledge about the RL belongs to the area of research. In its probability distribution of the game, and based on this gath- simplest definition, RL is learning the best actions based on ered information, the gambler can decide which lever to pull reward or punishment and, therefore, frequently used in robotic next. Because in the end, the casino always wins, these slot applications [12]. machines are more commonly known as bandits. The problem of determining the patient’s intervention prefer- A. Reinforcement Learning ences is similar to such a bandit problem. When an intervention There are three basic concepts in RL: state, action, and re- is needed, multiple actions can be chosen, and the goal is to get ward. The state describes the current situation. In the case positive feedback from the patient. The probability distribu- where the robot should select the most appropriate action, the tions of how the patient reacts to these interactions are not state will reflect the status of the patient. An action isevery- known and stay hidden inside the ‘bandit’. K-armed bandits thing the robot can do in a particular state. The robot will have more commonly define this possibility to select between multi- a fixed amount of possible interactions, but different RLtech- ple actions [14]. niques exist to deal with infinite action spaces. When the robot A tuple of k actions and the unknown probability distribu- a executes an action in a state, it receives a reward when the inter- tion over the rewards, (r) = P[r|a] with a an action, defines vention finishes. The term “reward” is a concept that describes a k-armed bandit problem. At each time step t, the agent se- ∈ the feedback from the environment and can be positive or neg- lects an action at A, and the reward is generated according to ative. A positive feedback corresponds to the usual meaning of the selected action. The goal of a k-armed bandit algorithm is reward. When the feedback is negative, it is corresponding to ∑to select those actions which maximise the cumulative reward, τ what is usually called a “punishment.” The interaction between t=1 rt with rt the received reward in every intervention t. state, action and rewards is simple and straightforward: Once the state is known, an action is chosen that, hopefully, leads to a Because the probability distributions of the rewards are un- positive reward [13]. Sutton et al. [14] described these different known, the value for each action should be estimated. The most RL concepts more in depth. optimal value for this problem is denoted by v∗ and follows di- The full RL paradigm is interested in the long-term reward rectly by selecting the best possible action. The total regret after several actions were taken. In this full reinforcement prob- score, which is the total difference of expectations between the lem, every action influences the next states, and different ac- optimal interaction and the chosen action in each time step, tions must be taken to receive the positive rewards. Other dif- expresses the performance of a bandit algorithm. The optimal ficulties can arise when there is an infinite action space orwhen action must be known a priori during the calculation of these the states change during the action selection procedure. regret scores, but this will never be the case. Therefore, the In the problem of determining the patient action preferences, bandit algorithm will have to explore the action space to find the concepts of this full RL can be simplified, because the di- the best actions. rect results of executing a single action will already give enough Exploring the action space, and deciding when to exploit the information to determine the action preferences. Another sim- current best action, is one of the main tasks within a bandit plification is the execution of one action per intervention, which problem. Several exploration-exploitation tactics are available avoids the occurrence of multiple states. The simplest version to balance between searching for the optimal actions and ex- of this problem bypasses all these state concepts and learns the ploiting the best one, but the best tactic depends on the prob- link between the executed actions and the possible benefits of lem. Each of these so-called policies estimates the value of the the interactions for each patient personally. Such a simplified current chosen action in every intervention. RL problem is known as a bandit problem. RL algorithms usually benefit from environmental informa- tion during the decision-making procedure. Many patients will characteristics to make a prediction of the probability distribu- react differently to the robotic interventions, and a single bandit tions in each intervention. This agent is based on the contex- algorithm will not differentiate between the patients. Learning tual abilities of Vowpal Wabbit, which is an open source package the benefits of the interactions for every patient separately can for efficient, scalable implementation of solve this issue, but patients can have many different BDs, and techniques [17]. these can occur in various situations. Separating all these differ- ent situations per patient leads to a fully personalised learning B. Policy system, but the learning effort grows largely without guarantee- ing that the learning phases converges. A second solution will Policies determine which action should be executed. They are incorporate contextual information in the learning phase and responsible for letting the agent explore its actions space before tries to learn globally over multiple patients, using some easily exploiting the best possible interactions. This problem is more definable characteristics. Studies showed there is a link between commonly known as the exploration-exploitation tradeoff [14]. the different types of dementia and the BDs [16]. A bandit al- Six different policies were analysed. gorithm can use these links to learn over multiple patients, by using a single start state defining the useful additional infor- Random policy: Select the actions at random for every in- mation of the patient. Algorithms trying to solve this problem tervention. This policy will only be beneficial when there is no are called contextual bandit problems and situations where such task to learn, but gives a lower bound for the other solutions. contextual information cannot be utilised efficiently, are rare in Greedy policy: At every intervention, there is at least one practice. action whose estimated reward value is the greatest. A Greedy policy will always select this action. III. Bandit design Epsilon-Greedy policy: While the random policy will keep Before the preferred actions can be learned, bandit algorithms exploring and the Greedy policy starts directly exploiting, it should be designed which can handle according to a received could be beneficial to combine these two policies. The Epsilon- feedback. Such algorithms are composed of an agent, a policy Greedy algorithm will select a random action with probability and a reward signal. ϵ and select the current best action with probability 1 − ϵ. Any value between 0 < ϵ < 1 will put a trade-off on the exploration A. Agent and exploitation of the interactions. The agent will control and execute the actions but has no ca- Decaying Epsilon-Greedy policy: The Epsilon-Greedy pability to decide which actions should be performed. It gathers policy has the disadvantage to explore forever with a predefined the knowledge from the environment after an intervention fin- ϵ > 0 . The first phase of the learning process could need more ishes and provides it to the learning mechanism of the bandit exploration, while later on, always exploiting the best actions algorithm. The collected knowledge is usually based received in is required. It could be beneficial to start with an ϵ = 1 and the form of one or multiple reward signals. This study analysed decay ϵ a small amount after every intervention. A predefined three different types of agents. step-size parameter will lower the ϵ value systematically until it reaches a lower bound. If this lower bound equals zero, the Normal agent: The normal agent, described by Sutton et al. policy will act greedy from upon that point. [14], uses the reward signals to represent the feedback directly. Upper Confidence Bound policy: Exploring the non- For each action, the mean value of the unknown probability dis- optimal actions according to their potential for actually being tribution will be updated with every new intervention. Based optimal, taking into account the uncertainties in those esti- on these mean values, the algorithm can determine which action mates, can be more beneficial. One effective way of doing this generates the highest reward. is to select actions according to the following equation: Gradient agent: The normal agent is beneficial when the dif- [ √ ] ferences between the actions are significant. When the reward log t signals of different interactions are close to each other, the agent at = arg max qt(a) + c , (1) a Nt(a) will alternate between these actions, and this could be confus- ing for the PwD. To cope with these small reward differences, a gradient agent will try to learn the relative difference be- with qt(a) the preference of action a at timestep t, logt the natu- tween the actions, instead of determining estimates of the re- ral logarithm of t, Nt(a) the number of times action a is selected wards. By doing this, it can efficiently determine a preference at time step t and c > 0 the degree of exploration, defining the of one action over another. The implemented gradient agent confidence level. The quantity being maximised is anupper updates the preference for an action in every observation [14]. bound on the possible actual preference of action a. Each time In observation∑ t, the agent selects action a with probability action a is selected, the uncertainty of its choice will be reduced. (Q[t,a]/τ) (Q[t,a]/τ) e / a e , where τ > 0 is the temperature spec- However, when an action different from a is selected, the un- ifying how randomly values should be chosen and Q[t, a] is the certainty of action a increases. The Upper Confidence Bound action preferences of action a at timestep t. When τ is high, the (UCB) policy selects actions according to Equation 1. actions are chosen in almost equal amounts. As the tempera- Contextual policy: It can be beneficial to use the predic- ture is reduced, the highest-valued actions are more likely to be tions, based on the contextual information, to determine the selected and, in the limit when τ goes to zero, the best action is action preferences. The patient’s characteristics are given as in- always chosen. The average of all the rewards until observation put to this policy, and it outputs the expected reward for each t act as a baseline. When newly received rewards are higher action. The most presumable action is then selected to be exe- than this baseline, the probability of taking action a increases. cuted. This policy can only be used together with a contextual If these values are below this baseline, the probability decreases agent because other agents do not have the predictable capacity respectively. to determine the action probabilities given the patient’s char- Contextual agent: The contextual agent uses several patient acteristics. C. Reward tests were executed with this simulator. How a PwD reacts The reward signal should represent the positive or adverse ef- to a robotic action is, however, unknown. The simulator anal- fects of the executed action during an intervention. The Naoqi ysed three different reaction strategies. One strategy defines framework, which operates on the Nao robot, has some useful one action which had a highly positive effect onto the PwD in modules to perform sentiment analyses for both vocal and vi- comparison with the other actions. Another strategy has four sual, through its sensors, captured data [18]. The Naoqi frame- similar action effects, and the last strategy defines four actions work can detect facial expressions, and these values were used having an adverse effect onto the PwD, but with one action to design a reward signal. slightly less bad. These strategies resemble an optimistic, neu- The facial expression analyses by the Naoqi module returns tral and worst case scenario respectively. Several tests, where provides a confidence score between 0 and 1, indicating how the action preferences for a single and multiple virtual patients likely an estimation is, for every of following five categories: must be determined, compared these three different cases. neutral, happy, surprised, angry or sad. During an intervention, the facial expressions of the patient are analysed multiple times, and the corresponding confidence values are saved. When the intervention has finished, a service personalisation algorithm will be used to determine whether the executed action had a positive effect on the PwD. Khosla et al. [19] designed sucha service personalisation algorithm for analysing song preferences based on facial expressions using a social robot. The algorithm in this research was adapted to return an appropriate reward signal in our problem based on all the captured facial expression data during an intervention. The facial expression values were divided into two groups. The first group collected all the neutral and happy facial expres- sions and the second group gathered the other three expression types. Equation reffreq calculates the frequency of occurring expressions in both groups. [ ] [ ] f⊕ n⊕/T f˜ = = , (2) Fig. 3. Robotic application overview f⊖ n⊖/T Figure 3 shows the designed application, which should even- with n⊕ and n⊖ the number of positive respectively negative tually be used inside the WONDER project. The application detected expressions and T the total amount of registered ex- aims to react upon BD events. When an intervention is needed, pressions. While the frequencies give more knowledge about the the application is signalised using such events. The application occurrences of the expressions, the amount of positivity or neg- will then first register several modules from the Nao robot to ativity should be relevant as well. The energy for the recorded enable the communication. The bandit package can determine expressions in both the negative and positive group is calculated the most beneficial action for this PwD, based on the historical using Equation 3. data in the Wonder database. The robot executes the chosen { action, and facial expression data is sent to a database during 0∑, if n = 0 me = n , (3) the intervention for further analyses. When the intervention ( 1 e)/n if n > 0 finishes, the bandit algorithm will gather all this stored data and builds a reward signal to learn the influence of the exe- with e the values of a captured facial expression according to cuted action. The agent observes this feedback, and preferences the associated group. For example, in the positive group, only are updated for further interventions. the values for the happy and neutral categories are combined to The most promising algorithms from the simulated tests were calculate the total energy of this group. used in a more realistic setting. During these practical tests, a

f⊕m⊕ + f⊖m⊖ real person mimicked expression in front of a Nao robot using R = (4) the designed application in Figure 3. f⊕ + f⊖

Equation 4 calculates the reward signal using the energies and V. Results frequencies of both groups. Three different test approaches were used to investigate, whether the designed bandit algorithms could be used for learn- IV. Application ing the action preferences of PwD. A first test case examines The designed application will now combine a single agent to- the effects of bandit algorithm onto a single patient in asim- gether with a policy to learn from the developed reward signal. ulated environment. The second test case analyses the bandit The policy will first select the action to be executed, and this algorithms onto multiple patients, again in a simulated environ- command is sent to the robot. When the intervention finishes, ment using contextual information. The last test case uses the the agent will receive the reward and observes it by comparing designed application in Figure 3 to test the action preferences its performance according to the executed action. of two real persons, using a real robot. Various policies can be used with different agents, and many different tests are needed to find the best combinations. Test- A. Single patient simulations ing these bandit algorithms directly onto PwD could result in The normal and gradient agents are compared, together with stress and are therefore avoided. A simulated application was the five possible policies. Results are visualised using three dif- designed to mimic the behaviour of these people, and many ferent plots describing the average amount of reward the bandit algorithms received over multiple interventions, the number of (a) optimal actions selected during the interventions and the cumu- lative regret score over multiple interventions. Tests for an optimistic case were executed for 20 randomly generated virtual patients. Four actions could influence the mood of a virtual patient during 100 consecutive interventions. The results in Figure 4 (a) show various bandit algorithms after performing 100 such experiments. The random policy illustrates the effect of randomly selecting actions and act as a baseline for the other bandit algorithms. The UCB policy can learn the preferred action after four interventions and has a low optimal regret score. Both the gradient and normal agent act similar using this UCB policy. A Greedy policy in combination with a gradient agent has almost the same average reward values but selects frequently sub-optimal actions. The same tests were executed in a neutral and worst case situation. In the neutral case shown in Figure 4 (b). The simi- larity between the actions changes the action selection strategy. (b) In only 50% of the situations, the optimal actions were selected. The algorithms are still better than the random policy which selects the best action one out of four. Figure 4 (c) shows the worst case situation. The average reward values are very low. Despite, the most bandit algorithms succeeded in detecting the preferred actions and tried to exploit them.

B. Simulation - Multiple patients During the multi-patient tests, contextual information from the virtual patients can be used to determine the action pref- erences. These tests will try to learn the action preferences for multiple patients. In each intervention, the virtual patient can now be different. These tests examine the benefits of using con- textual information instead of learning the action preferences from multiple PwD separately for each patient. Figure 5 sum- marises the results for 100 experiments with each 200 consecu- tive interventions. Gradient and normal agents are still individ- (c) ualised for all these tested patients and need more time to reach the global optimality. A contextual agent benefits clearly from the additional available information. The optimal actions are selected more frequently over the whole duration of these tests. Similar tests for the neutral and worst case scenario were exe- cuted and behaved similarly to the described tests for a single patient. Behavioural patterns of patients often change, and several tests examined the effect of these changes onto the bandit al- gorithms. Figure 6 shows the results of the test when the be- havioural pattern of multiple virtual patients changes randomly in the 200th intervention. All the bandit algorithms can detect this change, but the contextual agent using a contextual pol- icy can recover easily from such situations using the additional contextual information. The other bandit algorithms have more difficulties to recover from this abrupt change.

C. Real robotic test Fig. 4. Overview of the bandit algorithm tests for the (a) optimistic, (b) Results for the tests, which analysed the correct functioning of neutral and (c) worst case when a single patient is considered. the designed application using a real Nao robot and with a real person, are shown in Figure 7 (a) and Figure 7 (b). Two different individuals without a dementia related disease mimicked several lar facial expressions, the UCB policy could easily differentiate facial expressions in front of the robot which announced actions between the small differences and exploit the optimal action af- according to a UCB policy, in combination with the gradient ter a limited number of interventions. When the certainty of agent. Four different actions were possible, one per predefined an optimal action lowers due to less convinced expressions, the mimicked expression. The first test persons were respectively policy will shift its selection procedure. The last two interven- 24 and 27 years old, bot male. tions in Figure 7 (a) show such a change in preference, because In both tests, the robot could easily detect the action prefer- the reward values became too low for the current optimal action ences after 20 interventions. While both test persons had simi- to be considered beneficial. dit agents were designed, where one agent could benefit from the available contextual information. Several different bandit policies were compared to find the best balance between explor- ing and exploiting the action space. The facial expressions of the patients, analysed by the Nao robot, were used to provide a feedback signal for these bandits. Tests in a simulated envi- ronment with a single virtual patient showed the advantages of using an UCB policy which could determine the preferred ac- tions quickly. During multi-patient tests in the same simulated environment, the patient-specific information gives some clear benefits for learning globally. Therefore, the contextual agent using a contextual policy can recover fast from changes in be- havioural patterns. In a more realistic setting, with a real robot and real people, tests showed the correct functioning of both the learning algorithm and the designed framework. Further tests on real PwD will be needed to optimise the developed learning application further before it should be integrated within the WONDER system. The interactions between the robot and the Fig. 5. Overview of the bandit algorithm tests for the optimistic case when multiple patients are considered. patient could also be used to detect changes in behaviour, by searching for additional links between the patient’s expressions and his or her contextual information.

References [1] Alzheimer.net, “Alzheimer’s Statistics,” 2016. [2] An Vandervoort, Lieve Van den Block, Jenny T. van der Steen, Ladislav Volicer, Robert Vander Stichele, Dirk Houttekier, and Luc Deliens, “Nurs- ing Home Residents Dying With Dementia in Flanders, Belgium: ANation- wide Postmortem Study on Clinical Characteristics and Quality of Dying,” Journal of the American Medical Directors Association, vol. 14, no. 7, pp. 485–492, jul 2013. [3] Fightdementia, “Alzheimer’s Australia | Drugs used to relieve behavioural & psychological symptoms of dementia,” . [4] Jiska Cohen-Mansfield, “Nonpharmacologic Interventions for Inappropri- ate Behaviors in Dementia,” Am J Geriatr PsychiatryAm J Geriatr Psychi- atry, vol. 94, no. 9, pp. 361–381, 2001. [5] Kazuyoshi Wada, Takanori Shibatal, Toshimitsu Musha, and Shin Kimura, “Effects of robot therapy for demented patients evaluated by EEG,”in 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2005. [6] Meritxell Valenti Soler, Luis Agüera-Ortiz, Javier Olazaran Rodriguez, Carolina Mendoza Rebolledo, Almudena Pérez Muñoz, Irene Rodriguez Pérez, Emma Osa Ruiz, Ana Barrios Sanchez, Vanesa Herrero Cano, Laura Carrasco Chillon, Silvia Felipe Ruiz, Jorge Lopez Alvarez, Beatriz Leon Fig. 6. Overview of the bandit algorithm tests when a change in behaviour Salas, José Marña Cañas Plaza, Francisco Martin Rico, and Pablo Marti, occurs for multiple patients during the 200th timestep. Martinez, “Social robots in advanced dementia,” Frontiers in Aging Neu- roscience, vol. 7, no. JUN, 2015. (a) [7] Kaoru Inoue, Naomi Sakuma, Maiko Okada, Chihiro Sasaki, Mio Naka- mura, and Kazuyoshi Wada, “Effective application of PALRO: A humanoid type robot for people with dementia,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8547 LNCS, pp. 451–454. [8] Francisco Martín, Carlos E. Agüero, José M. Cañas, Meritxell Valenti, and Pablo Martínez-Martín, “Robotherapy with dementia patients,” Interna- tional Journal of Advanced Robotic Systems, 2013. (b) [9] Zorarobotics, “welcome to Zorabots - we make your life easier!,” . [10] Imec, “Project: WONDER,” . [11] Nikhil Buduma, “Deep Learning in a Nutshell 3,” Blog, vol. 2, no. Decem- ber, pp. 1–7, 2014. [12] Jens Kober, J. Andrew Bagnell, and Jan Peters, “Reinforcement learning in robotics :,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2015. Fig. 7. Robotic test with 20 interventions and an UCB policy with c=0.01 [13] Junling Hu, “Reinforcement learning explained,” 2016. for (a) the first test person (duration single event: 30s) and (b)the [14] R S Sutton and A G Barto, Reinforcement learning : an introduction, vol. 9, second test person (duration single event: 10s). 2013. [15] Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 1.5: Contextual Bandits,” 2016. [16] Ming-Jang Chiu, Ta-Fu Chen, Ping-Keung Yip, Mau-Sun Hua, and Li- VI. Conclusion Yu Tang, “Behavioral and Psychologic Symptoms in Different Types of Dementia,” Journal of the Formosan Medical Association, vol. 105, no. 7, This paper investigated whether self-learning algorithms pp. 556–562, 2006. could be designed for the personalised interaction with PwD [17] J. Langford, L. Li, and A. L. Strehl, “Vowpal Wabbit (fast online learn- when a particular BD occurs and needs to be elevated. RL ing),” 2007. [18] Aldebaran-Robotics, “NAOqi Framework,” 2016. techniques were used to determine the action preferences of the [19] Rajiv Khosla, Khanh Nguyen, Mei-Tai Chu, and Yu-Ang Tan, “Robot patients and selected the one which resulted in the highest re- Enabled Service Personalisation Based On Emotion Feedback,” Proceedings ward. The less complex bandit algorithms can be applied when of the 14th International Conference on Advances in Mobile Computing and Multi Media - MoMM ’16, pp. 115–119, 2016. different state occurrences were ignored. Three different ban- CONTENTS

Preface i

Abstract iii

Extended abstract iv

List of Figures vii

List of Tables viii

List of abbreviations ix

1 Introduction 1 1.1 Research objective ...... 1 1.2 Outline ...... 1

2 Literature Study 3 2.1 Dementia ...... 3 2.1.1 Dementia types ...... 3 2.1.2 Behavioural disturbances (BDs) ...... 5 2.1.3 Dementia Outcome Measures ...... 6 2.1.4 Non-pharmacological treatements ...... 8 2.2 Social Robotics ...... 10 2.2.1 Robot-Assisted Therapy (RAT) ...... 10 2.2.2 Robot interventions ...... 11 2.3 Artificial Intelligence ...... 12 2.3.1 Bayesian network ...... 12 2.3.2 Expert system ...... 12 2.3.3 Machine Learning ...... 12 2.3.4 Comparison ...... 13

3 Reinforcement Learning 15 3.1 General Conceptes ...... 15 3.1.1 Conceptual details ...... 16 3.1.2 Solving reinforcement learning problem ...... 18 3.2 Model-based solutions ...... 19 3.2.1 Policy iteration ...... 19 3.2.2 Value iteration ...... 19 3.3 Model-free solutions ...... 19 3.3.1 Model-free prediction ...... 20 3.3.2 Model-free control ...... 21 3.4 Bandits ...... 22 3.4.1 The exploration-exploitation paradigm ...... 23 3.4.2 Contextual bandits ...... 23 3.5 Deep Reinforcement Learning ...... 24 3.5.1 Meta reinforcement learning ...... 25 3.6 Main findings ...... 25

iv 4 The WONDER Project 26 4.1 Project background ...... 26 4.2 Contribution of dissertation ...... 27

5 Scenarios 28 5.1 General scenario ...... 28 5.1.1 Actions ...... 28 5.2 Screaming behaviour ...... 29 5.3 Wandering behaviour ...... 29 5.4 Advanced wandering behaviour ...... 30 5.5 Unwanted visiting behaviour ...... 31 5.6 Multi-action intervention ...... 32

6 Bandit Design 33 6.1 Bandit agents ...... 33 6.1.1 Normal agent ...... 33 6.1.2 Gradient agent ...... 34 6.1.3 Contextual agent ...... 35 6.2 Policy implementations ...... 38 6.2.1 Random policy ...... 38 6.2.2 Greedy policy ...... 38 6.2.3 Epsilon-greedy policy ...... 38 6.2.4 Decaying Epsilon-greedy policy ...... 38 6.2.5 Upper Confidence Bound policy ...... 39 6.2.6 Contextual policy ...... 39 6.3 Reward signal ...... 39 6.3.1 Sensors ...... 40 6.3.2 Reward design ...... 43 6.4 Linked bandits ...... 47 6.5 Main findings ...... 48

7 Implementation 49 7.1 Bandit application ...... 49 7.1.1 Application packages ...... 49 7.1.2 Analytical packages ...... 50 7.1.3 Data storage ...... 51 7.2 Framework ...... 52 7.2.1 Simulation ...... 52 7.2.2 Hardware ...... 52

8 Simulation 54 8.1 Architectural design ...... 54 8.1.1 Bandit package ...... 55 8.1.2 Environment ...... 56 8.1.3 Virtual patient ...... 56 8.1.4 Virtual robot ...... 59 8.1.5 Simulator ...... 62 8.2 Simulation tests ...... 63 8.2.1 Overview ...... 63 8.2.2 Single Patient tests ...... 65 8.2.3 Multi-Patients tests ...... 75 8.3 Conclusion ...... 89

v 9 Learning Application 90 9.1 Architectural design ...... 90 9.1.1 Application ...... 91 9.1.2 Wonder module ...... 91 9.1.3 Event Manager ...... 91 9.1.4 BanditSetup ...... 92 9.1.5 Bandit package ...... 92 9.2 Robot tests ...... 92 9.2.1 Gradient agent performance ...... 92 9.3 Conclusion ...... 94

10 Conclusion and Future Research 95 10.1 Summary ...... 95 10.2 Conclusion ...... 97 10.3 Recommendations for further research ...... 97

Bibliography 99

Appendices 105

vi LIST OF FIGURES

2.1 Dementia types ...... 3 2.2 Different Behavioural Disturbances ...... 7 2.3 Non-pharmacological Interventions ...... 9 2.4 Three different types of social robots ...... 11

3.1 Reinforcement Learning scheme ...... 15 3.2 Reinforcement learning types ...... 24

4.1 Conceptual architecture of WONDER ...... 27

5.1 General intervention scenario ...... 28 5.2 Screaming behaviour intervention ...... 29 5.3 Wandering intervention ...... 30 5.4 Advanced wandering intervention during day and night ...... 31 5.5 Unwanted visit intervention ...... 32 5.6 Multiple actions intervention ...... 32

6.1 Bernoulli distributed actions with a normal agent ...... 34 6.2 Bernoulli distributed actions with a gradient agent ...... 35 6.3 Grovepi module ...... 42

7.1 Implementation packages overview ...... 49 7.2 HDF5 structure ...... 51

8.1 Simulation architecture ...... 55 8.2 Dementia diagnose distribution ...... 57 8.3 Alzheimer disease symptom’s progression ...... 58 8.4 MMSE score when first diagnosed ...... 58 8.5 Naoqi architecture ...... 60 8.6 Patient robotic analyses module ...... 61 8.7 External sensor architecture ...... 62 8.8 Test selection procedure ...... 63

9.1 Robot application overview ...... 90 9.2 Robot application class diagram ...... 91

vii LIST OF TABLES

6.1 Robot sensors ...... 40 6.2 Body sensors ...... 41 6.3 Environment sensors ...... 42

8.1 age population ...... 57

9.1 proposed reactions according to the robotic actions ...... 92 9.2 Average facial expression scores for the test person 1 ...... 93 9.3 Average facial expression scores for the test person 2 ...... 93

viii LIST OF ABBREVIATIONS

RL2 Meta-reinforcement learning.

AAT Animal-Assisted Therapy. ACE Addenbrooke’s Cognitive Examination. AD Alzheimer disease. AES Apathy Evaluation Scale. AI Artificial Intelligence. API Application Programming Interface.

BD Behavioural Disturbance. BDS Blessed Dementia Scale. BEHAVE-AD Behavioral pathology in Alzheimer’s Disease Rating Scale.

CDR Clinical Dementia Rating. CJD Creutzfeldt-Jakob disease. CMAI Cohen-Mansfield Agitation Inventory. CSDD Cornell Scale for Depression in Dementia. CT Computerized Tomography.

DCM Device Communication Manager. DLB Dementia with Lewy bodies. DOM Dementia outcome measures.

FTD Frontotemporal degenerations dementia.

HAL Hardware Abstract Layer.

MDP Markov decision process. MMSE Mini-Mental State Examination. MRI Magnetic Resonance Imaging.

NH Nursing Home. NPI Neuropsychiatric Inventory.

ix PD Parkinson’s disease. PET Positron Emission Tomograph. PwD People with Dementia.

QoL Quality of Life. QUALIDEM Quality of life in late-stage dementia.

RAID Rating Anxiety in Dementia. RAT Robot Assisted Therapy. RSSI Received Signal Strength Indicator.

S-MMSE Standardized Mini-Mental State Examination. Sarsa State-Action-Reward-State-Action.

TD Temporal difference.

UCB Upper Confidence Bound.

VW Vowpal Wabbit.

x 1 INTRODUCTION

Worldwide, 44 million people have dementia, an overall term that describes a broad range of symptoms associated with a decline in memory or other thinking skills severe enough to reduce a person’s ability to perform everyday activities [1, 2]. In Flanders, one in ten of the elderly over the age of 65 has a dementia related disease, and this amount even increases to nearly 20% at the age of 80 [3]. Almost 43% of People with Dementia (PwD) are staying at Nursing Homes (NH), rising to 76% of those with advanced dementia [4]. Almost all these PwD suffer from so-called Behavioural Disturbances (BDs) like mood disorders (depression, apathy), sleep disorders, psychotic symptoms (hallucinations) and agitations (wandering and aggression) [5]. These BDs im- pose an enormous emotional toll on the patient’s family and increase the burden and stress of caregivers because of the increased strain on the available resources within healthcare [6]. Current best practices in the management of BD mandates pharmacological intervention only upon acute situations since these do not address the underlying psychosocial reasons and may have adverse side effects [7]. The guidelines for person-centric care encourage nursing homes to reduce the amount of drug related interventions, and alternatives were found in non-pharmacological treatments such as music and animal therapy [8]. Several studies show Animal-Assisted Therapy (AAT) to be beneficial in decreasing agitation in PwD [9]. However, most hospitals and nursing homes are afraid of adverse effects of animals on human beings, such as allergy, infection, bites, and scratches even though they admit the benefits of AAT [10]. To elevate the drawbacks of these animal therapies, robots were used instead, and several studies noticed similar positive effects [10–12]. While these robotic therapies are beneficial in group sessions, individualised robotic care applications are scarce and never applied in a dementia context. The preprogrammed interactions and the need for a more user-friendly frame- work for the nursing staff to install and control the individualised sessions are the main reasons for the absence of these studies [13]. These individualised care applications are however more effective than the more global group therapies, but the nursing staff will not be able to provide them in the future without additional aid [14].

1.1 Research objective

This dissertation will investigate how personalised and individualised interactions can be performed onto the PwD by using a full autonomous social robot. Self-learning algorithms will be designed to determine which type of information is ideally used to define the current state of the PwD and which actions are the most appropriate to prevent and alleviate a specific BD. The overall goal is to improve the Quality of Life (QoL) of the residents and reduce the prevalence of BDs. All related information can be useful to determine the current state of the PwD. Data from sensors available on the body of the PwD will be combined with the personal information from the patients to get a complete overview of their current status. The sensors available in the NH will analyse the environment where the PwD currently behaves. The combination of both the data available on the PwD and the information gathered in the NH will influence the action selection procedure of the robot for a current situation, for a specific PwD.

1.2 Outline

The goal of this dissertation is to build a system that can learn or tries to learn the most appropriate action for each single PwD who suffers from a particular BD using the sensory data available from a robot and additionally installed sensors both on the PwD and in their environment.

1 Before a system can be built to learn the most appropriate action, different learning algorithms were researched and summarised in Chapter 2. Artificial intelligence has gained popularity during the last decades because of the development of new techniques, cheap computational power offered by public clouds and the influence of the big data hype [15]. The learning techniques that were the most beneficial for the described problem are used during this dissertation. After researching these different learning technologies, knowledge about dementia and the various stages of this disease could help during the design of the application. The representation of the environment and the psychological states of a PwD are analysed in Chapter 2 as well, to give the system more degrees of freedom to operate in many different situations. The most suitable learning technique for this dissertation, reinforcement learning, is described in depth in Chapter 3. This master dissertation is part of a bigger project where a robot could autonomously walk from one resident to another, each time engaging in a personalised interaction, e.g. playing a favourite song or asking questions about memorable events in the lifetime of the PwD [16]. This so-called WONDER project is briefly described in Chapter 4 together with the contribution of this dissertation. Using the concepts of the provided learning techniques, several situations in which the system should be able to op- erate were analysed and summarised in Chapter 5. This dissertation will mainly focus on wandering and screaming situations. All the described scenarios show why a learning system should be beneficial and how the concepts de- scribed in the previous chapters fit in this system. The effective design and implementation details of these learning techniques are described in Chapter 6 and Chapter 7 While the learning algorithms will be necessary to select the most appropriate actions, feedback signals will provide the main tools to determine if the executed action had a positive effect on the PwD or not. These feedback signals are mostly called reward signals and will be a combination of both robotic sensor values and values from sensors in the environment or on the body of the PwD. Which sensors and how they can be combined to give accurate feedback on the currently executed action will be researched as well in Chapter 6. Different learning methods and algorithms can have different benefits or drawbacks. The most appropriate technique for the personalised interactions can only be selected by comparing them with each other. While real data will be very limited during this dissertation and real experiments in nursing homes will be impossible to do in a constant and controlled environment, a simulated environment will be designed in Chapter 8 to compare the different learning algorithms. The scenarios described in Chapter 5 can be easily tested using this simulated environment, but a simulated PwD will not match exactly with a real one and the differences must be kept in mind. Therefore, Chapter 9 describes some first, more realistic tests using a real person and a real robot. The simulated environment gives the opportunity for some additional scenarios. When a new PwD admits at the NH, it would be useful that the most appropriate actions can be selected without a whole new learning process have to be executed. This scenario will require that the system can learn from multiple patients while actions are still personalised for each PwD individually. Different learning techniques will be useful in this situation. Secondly, when a patient’s dementia status changes, actions different from the currently learned action can become preferred. With the use of a simulated environment, a controlled change in the behavioural pattern can be analysed, and an algorithm can be designed which deals with these fluctuating mood patterns.

2 2 LITERATURE STUDY

Before the algorithm, which will select the most appropriate action for a particular PwD can be designed, knowledge about robotics, dementia and learning algorithms must be acquired. This chapter summarises the following three research fields of this dissertation: • A brief overview of dementia related diseases together with the different behavioural disturbances and their treatments is given in Section 2.1. • Section 2.2 gives a summary of the most interesting studies in social robotics. • Artificial intelligence described in Section 2.3 addresses the different learning possibilities and why several of these methods could not be used in this dissertation.

2.1 Dementia

Dementia is not a disease itself, but a so-called major neurocognitive disorder: a group of symptoms caused by other conditions. It manifests when parts of the brain used for learning, memory, decision-making and language are dam- aged or diseased [17]. In 2015, 46.8 million people worldwide lived with dementia. This number will almost double every 20 years, reaching 74.7 million in 2030 and 131.5 million in 2050 [18]. Almost all PwD exhibit BDs and these im- pose an enormous emotional toll on the patient, family and increase the burden and stress of caregivers [19]. The following sections give an overview of the different types of dementia and the different BDs. While there exists some great medication to suppress several types of BDs, the benefits are rather limited in time, and there is a need for frequent monitoring of the PwD [20]. Therefore, this section will give an overview of the different types of non- pharmacological therapies only.

2.1.1 Dementia types

The different types of dementia can be classified by the region of the brain that is affected. The cortical and subcor- tical regions divide the brain into two. The cortical region consists of the cerebral cortex while the subcortical region contains the thalamus, hypothalamus, cerebellum and brain stem [21]. Dementia can affect a single part of the brain, but there exist dementia types which affect both regions at the same time.

Figure 2.1: The brain can be divided into the cortical and subcortical regions [21].

3 2.1.1.1 Cortical dementias

The outer layer of the brain plays a critical role in memory and language abilities of a person. It is, therefore, that PwD with cortical dementia have severe memory loss and can not remember words or have difficulties to speak. The dementia types affecting this region are commonly caused by the Alzheimer or Creutzfeldt-Jakob disease [17]. • Alzheimer Disease: The Alzheimer Disease (AD) starts with subtle and poorly recognised failure of memory and slowly becomes more severe and, eventually, incapacitating. 60% to 80% of the PwD are diagnosed with this disease and, 95% of the people with AD are above the age of 65. Establishing the diagnosis of AD relies on clinical-neuropathologic assessments. The diagnosis of AD is correct for approximately 80%-90% of the time and combines a CT and a PET scan of the brain together with interviews of both the PwD and their family members. The treatment for the AD disease is currently supportive managing each symptom on an individual basis. Assisted living or care in a nursing home is usually necessary for these PwD [22]. • Creutzfeldt-Jakob Disease: In comparison with the AD, the Creutzfeldt-Jakob Disease (CJD) is a type of dementia that advances rapidly. CJD is a rare, degenerative and fatal brain disorder characterised by progressive dementia, blindness and involuntary movement. Over one year, it affects only one person per one million people worldwide. In the early stages of the disease, people may suffer from failing memory, lack of coordination and visual disturbances. As the illness progresses, mental deterioration becomes pronounced and involuntary movements, blindness, weakness of extremities, and coma may occur. 90% of the PwD with a diagnosis of the CJD die within one year. For doctors it is difficult to determine if a PwD has AD or CJD. In fact, at least 20 percent of the AD diagnoses are supposedly in error, with the error being that the disease is CJD. However, a correct diagnosis is necessary because CJD patients must be handled with much more care. The disease can be contagious through an injection or consumption of an infected brain or nervous tissue. Current treatments for CJD try to alleviate the symptoms and making the individual as comfortable as possible [23][24].

2.1.1.2 Subcortical dementias

While in the previous section, the dementia types mostly affect the cortical region of the brain, other kinds of demen- tia diseases damage the regions lying under the cortex. These dementia diseases give rise to a category known as subcortical dementias. These dementia types are more likely to affect the attention, motivation and emotionality of the patients. People with subcortical dementia often show early symptoms of depression, clumsiness, irritability or apathy. As the disease progresses further, memory loss and decision making problems arise such that the end stages of subcortical dementias and cortical diseases are mostly similar [18][25]. Parkinson’s disease is the most common type of subcortical dementia. • Parkinson’s Disease: While this disease is commonly known as a motor disorder, 30% of patients with Parkinson’s disease (PD) also develop subcortical dementia. In a first stage, PD will affect the nerve cells in the brain by breaking them down. These nerve cells mostly generate dopamine, which is used to send signals to the part of the brain that controls movements. When such nerve cells break down, the dopamine level decreases and this results in movement disorders. The disease will gradually spread to other parts of the brain and will often affect mental functions, including memory and the ability to pay attention. A person originally diagnosed with PD develops the symptoms of dementia usually after one year or later. Since PD patients are at high risks for dementia as their disease progresses, signs of changing thinking or changing emotional behaviour are examined more often using MRI scans or personal interviews. There are currently no treatments to slow down or stop the brain cell damage causing dementia for people with PD. Current strategies focus on elevating the symptoms or suppressing the side effects [18][25].

4 2.1.1.3 Mixed dementia forms

In many cases, the disease causing dementia affects both the cortical and subcortical parts of the brain. Three common diseases will be discussed here. • Dementia with Lewy Bodies: Dementia with Lewy Bodies (DLB) is a progressive dementia disease that leads to a decline in thinking, rea- soning and independent functioning because of abnormal microscopic deposits, so-called Lewy bodies, which damage the brain cells over time. These Lewy bodies were found in other dementia diseases as well, such as the Parkinson’s disease, but a link between these two diseases is not found yet [26]. While memory loss is less prominent than in the Alzheimer’s disease, DLB is mostly expressed by changes in thinking and reasoning, hallucinations, delusions and sleeping disorders. The brain cell damage caused by the Lewy bodies cannot be slowed down or stopped. Therefore, current treatments focus on elevating the symptoms as much as possible [18]. • Vascular Dementia: The brain has one of the richest networks of blood vessels and is, therefore, more vulnerable for strokes. These blocked blood vessels can damage or kill many brain cells by depriving them of vital oxygen and nutrients. In vascular dementia, changes in thinking skills sometimes occur due to these strokes. Symptoms, being confusion, disorientation and troubles when speaking or understanding other people, happen already soon after a major stroke. The risk factors for vascular dementia are the same ones that raise the risks of heart problems and strokes, for example, smoking and a high blood pressure. There are currently no drugs which can alleviate the side effects of vascular dementia. However, there is some clinical evidence that certain drugs approved for the AD may have a modest benefit for people diagnosed with vascular dementia. As with other stroke symptoms, cognitive changes may sometimes improve during recovery and rehabilitation because of the generation of new blood vessels and the ability of brain cells outside the damaged region to take on new roles. Vascular dementia is for this reason, sometimes “curable” [18]. • Frontotemporal Dementia: Progressive nerve cell losses can occur in the brain’s frontal lobes which are more in general, the area behind the forehead, or its temporal lobes, defined as the regions behind the ears. The group of disorder caused by these cell losses is called frontotemporal degenerations or Frontotemporal Dementia (FTD). The frontal lobe plays a significant role in the voluntary movement, for example, simple walking. The temporal lobe plays a key role in the formation of explicit long-term memory. Therefore, the nerve cell damage caused by FTD leads to loss of several brain functionalities. These damages cause deterioration in behaviour and personality, language disturbances, or alterations in muscle or motor functions. There is currently no specific treatment for FTD. Current treatments try to decrease the agitation, irritability and or depressions of these PwD to improve their quality of life [18].

2.1.2 Behavioural disturbances (BDs)

Behavioural disturbances are the primary sources of a caregiver’s burden and the most important reason why PwD must be institutionalised. Despite the recent advances in the development of new drug treatments, the capabilities of modern medicines to improve cognitive functions or delay the mental deterioration process for PwD remain mod- est [27]. These BDs are mostly grouped into seven different categories according to the Behavioural Pathology in Alzheimer’s Disease Rating Scale BEHAVE-AD [28][29]: • Paranoid and delusional ideation: Delusions are a false idea of something originating in a misinterpretation of a situation. For example, individ- uals may think that family members are stealing from them or that the police are following them. This kind of suspicious delusion is more commonly known as paranoia [18].

5 • Hallucinations: False perceptions of objects or event are more commonly known as hallucinations and are sensory in nature. When PwD have hallucinations, they see, hear, smell, taste or even feel things that are not real. For example, individuals see insects crawling on their hands, or they may hear people talking to them and respond to these voices [18]. • Activity disturbance: Inappropriate behaviour of a PwD without the manifestation of aggression or agitation is defined being an activity disturbance. Wandering is the most common disturbance in this category. It is quite common that a PwD moves from place to place without a fixed plan or goal. • Aggressiveness: PwD can suffer from different types of agitation: verbal or physical outbursts, general emotional distress, rest- lessness, pacing, shredding paper or tissues [18]. • Diurnal rhythm disturbances: Difficulties with time and daily patterns are defined being diurnal rhythm disturbances. Sleep disorder is the most common disturbance in this category and is common in all dementia diseases. • Affective disturbances: Disorders caused due to mood changes are called affective disturbances. Screaming or a depressed mood are common disturbances in this category. • Anxieties and phobias: A feeling of worry, nervousness, or unease about something with an uncertain outcome is called anxiety. Phobia is an extreme or irrational fear of or aversion to something. Both symphonies are common for PwD. It is common that PwD suffer from more than one of these BDs. However, Chia et al. [27] found a link between the different types of dementia and the behavioural and psychological symptoms. In their research, hallucinations were more common for dementia with Lewy bodies, activity disturbances were noticed more frequently in frontotempo- ral dementia, and vascular dementia patients are more sensitive to aggressiveness, diurnal rhythm disturbances or paranoid/delusional characteristics. A huge number of conditions can cause BDs, and an accurate detection of these causes lies in the success of an efficient solution for the BDs. The correct identification of the symptoms such as depression or hallucinations together with the proper diagnosis of the dementia disease will greatly enhance the success of pharmacological or non-pharmacological interventions [30]. Figure 2.2 gives an overview of the most popular BDs. The most common causes of these BDs are recognised as “unmet needs”. Many PwD inside nursing homes suffer from sensory deprivation, loneliness and boredom. Elevating these unmet needs will eventually lead to less behavioural disturbances [31].

2.1.3 Dementia Outcome Measures

The Dementia Outcome Measures (DOMs) are useful tools to validate the assessment of various aspects of dementia by health care professionals. Both the detection, severity and progression of dementia can be determined with these measures. This section gives an overview of the most commonly used DOMs summarised by their dementia aspect. Measures for cognitive decline, the different stages, behavioural disturbances and quality of life are given.

2.1.3.1 Cognitive decline

Cognitive decline is one of the earliest symptoms of dementia. An early diagnosis is necessary to start the most appropriate treatments. Cognitive screening can be used to make a diagnosis.

6 Figure 2.2: Overview of the most popular behavioural disturbances for people with dementia

• Mini-Mental State Examination (MMSE): MMSE is a commonly used set of questions for screening cognitive function. This examination can be used to indicate the presence of cognitive impairment. This review technique has already been validated in many different populations. Scores of 25-30 out of 30 are considered normal, where 21-24 as mild, 10-20 as moderate and below ten as severe impairment. The MMSE may not be an appropriate assessment if the patient has learning, communication or other disabilities [32,33]. • Standardized Mini-Mental State Examination (S-MMSE): S-MMSE is a more advanced version of the MMSE for which the administration and scoring of the test are stan- dardised. The S-MMSE has, in contrast to MMSE, a detailed manual describing how to administer and score each item. There is some evidence that this standardisation improves the reliability and diagnostic capacity of the test [34]. • Addenbrooke’s Cognitive Examination (ACE): The ACE is a comprehensive screening tool and the most recommended instrument for all dementias when shorter screenings are inconclusive. It is useful for differential diagnosis between AD, FTD, PD and related neurodegenerative conditions [35].

2.1.3.2 Staging

Staging measures are used to assess the severity and progression of a dementia disease. • Clinical Dementia Rating (CDR): CDR is a worldwide standard for assessing the severity and progression of dementia. CDR calculates two dif- ferent scores: a global score which is the standard regularly used in clinical and research settings and a score which provides more comprehensive information [36]. • Blessed Dementia Scale (BDS): BDS is a very brief staging instrument that can be completed by any caregiver. It has a very long history of use, and its simplicity makes it suitable for use by nursing staff working in care facilities. The BDS consists of 22 items that reflect changes in performance of everyday activities, changes in habits including self-care, and changes in personality, interests, and drives [37].

2.1.3.3 Behavioural and Psychological Symptoms

The majority of people with dementia suffer from so-called behavioural disturbances. Some scales are designed to assess multiple BDs, while others are peculiar to a single disturbance.

7 • Neuropsychiatric Inventory (NPI): The NPI can discriminate persons with FTD from those with AD based on their symptom profile, and detect clinically significant changes in BDs over the course of dementia disease. This technique has been very well validated and is highly popular worldwide. NPI covers the following types of BDs: delusions, hallucinations, ag- itation or aggression, dysphoria or depression, anxiety, euphoria or elation, apathy or indifference, disinhibition, irritability or lability, aberrant motor behaviours, sleep disorders and eating disturbances [38]. • Apathy Evaluation Scale (AES): AES comprises 18 core items that assess and quantify the affective, behavioural, and cognitive domains of ap- athy. Three variants exist based on the interrogator: self (AES-S), informant (AES-I) or clinician (AES-C) [39]. • Cohen-Mansfield Agitation Inventory (CMAI): The CMAI is a comprehensive measure of agitation which has been well-validated for people with dementia. It assesses three domains: aggressive behaviour, non-aggressive behaviour, and verbally aggressive behaviour [40]. • Cornell Scale for Depression in Dementia (CSDD): CSDD was specifically developed to assess signs and symptoms of major depression in PwD through two semi- structured interviews: an interview with an informant and an interview with the patient. CSDD can discriminate people with depression from those with dementia [41]. • Rating Anxiety in Dementia (RAID): RAID is a rating scale to measure the anxiety level in PwD. RAID scorings have been shown to be unrelated to the degree of dementia severity or cognitive impairment [42].

2.1.3.4 Quality Of Life (QoL)

QoL broadly refers to a person’s sense of subjective well-being across several domains, including physical, psycho- logical and social. Dementia-specific QoL measures assess the efficacy of health and social service interventions for persons with dementia. • Health-related quality of life for people with dementia (DEMQOL): DEMQOL is a self-report measure for QoL designed for people with mild-to-moderate dementia. The test has 28 items that cover the four most important QoL dimensions: daily activities, memory, negative emotion and positive emotion [43]. • Quality of Life in Late-stage Dementia (QUALIDEM): QUALIDEM was designed to assess QoL in people with moderate-to-severe dementia explicitly living in long- term care facilities. The QUALIDEM test contains 11 items describing observable behaviours including affective state, behavioural signs of comfort, and engagement in activities and interactions with others, rated on a five- point ordered category scale [44].

2.1.4 Non-pharmacological treatements

In the past, behavioural disturbances were mostly treated with psychotic drugs, physical restraints or even ignored. Several types of research and clinical observations have questioned these practices, leading to the OBRA 87 mandate to reduce the use of drugs in nursing homes where possible [45]. In response to this mandate, a significant number of non-pharmacological interventions have been initiated and analysed [8].

8 Non-pharmacological interventions have two key benefits: 1. Medication can mask the actual needs by eliminating the real behaviour serving as a signal for the need. Non- pharmacological interventions address the physiological or environmental underlying reason of the BD. 2. Pharmacological treatments can have many restrictions, for example, side effects or drug-to-drug interactions. Non-pharmacological interventions avoid these limitations. Cohen-Mansfield made an excellent overview of all different non-pharmacological interventions and their effects on PwD. He summarised them in 8 different categories [8]: • Sensory enhancement interventions are mostly classified being relaxation methods and reduce stress in PwD. • Social contacts can be both real or simulated and tend to reduce the verbal aggression of the patients. • Behavioural interventions try to reinforce the PwD including social reinforcements. • Staff training focuses on understanding inappropriate behaviours, improving verbal and non-verbal communi- cations with PwD, and improving methods of addressing their needs. • Recreational interventions in the form of structured activities are helpful in relieving agitation. Outdoor walks are the most common structured activity. • Environmental changes, for example, free access to an outdoor area or a natural environment consisting of recorded songs of birds and babbling brooks decrease a PwD level of agitation. • Nursing interventions will remove restraints and offer pain management to reduce the amount of psychotropic medication used by these PwD. • Combination interventions usually combine structured activities and nursing care interventions and can have an individualised approach, where a treatment plan fits per patient by his or her previous treatment, abilities, and type of problems.

Figure 2.3: Overview of all different non-pharmacological interventions [8]

9 2.2 Social Robotics

While there exist a large number of non-pharmacological interventions, robots within healthcare gained much popu- larity in the last decade [46]. Healthcare robots are primarily concerned with helping users to improve or monitor their health. Rehabilitative robots can assist, encourage, monitor and remind patients to perform their exercises to increase their personal health or recover more quickly and robot receptionists currently help patients to get registered when hospitalised and can guide them to the correct nursing ward [11, 47]. In this sections, the effects of social robotics on elderly with dementia are investigated.

2.2.1 Robot-Assisted Therapy (RAT)

Section 2.1.4 summarises several different non-pharmacological treatments. One of the most interesting techniques for elevating behavioural disturbances is real or simulated social contact, and robot-assisted therapy is such a possible treatment. This type of therapy has been used in care for elderly who suffer from dementia for over ten years, and strong effects like improved interaction and signs of a higher sense of wellbeing have been reported [48]. This section compares three different robot types and together with their possibilities in RAT.

2.2.1.1 Paro

Elderly, who have social interactions with other people, have fewer chances to get a dementia related disease. Pre- vention is important because of the growing population and the steady increase of PwD. Animal interactions have long been known to have a beneficial effect on people, and the AAT became widely used in hospitals and nursing homes [49]. While animals have psychological, physiological and social effects, most hospitals and nursing homes do not accept them because they are afraid of the adverse effects of animals on human beings. Therefore, a baby harp seal shaped robot, Paro, has been designed, which has the same beneficial results on the patients, but without the adverse ef- fects of real animals such as allergy, infection, bites, and scratches. The Paro robot is visualised in the left picture of Figure 2.4. Paro was intentionally designed for autistic and inpatient children between the age of 2 and 15 years old to improve and encourage them to communicate with each other and the caregivers. After 11 days of observation, Paro had a rehabilitative as well as a mental effect on these children [50]. Several researchers used Paro with PwD in several NH around the world [10,12]. These studies showed that the cortical neurone activity improves by interacting with these seal robots and an increased QoL score, especially for patients who liked the robot, was noticed as well. Paro even encouraged people with dementia to interact, both verbal and non-verbal, with each other.

2.2.1.2 Parlo

While Paro reduces stress, aggression and other disturbances for PwD, the animal shaped robot is unable to commu- nicate with the patients. In contrast, Parlo is a humanoid typed robot which can communicate with humans through voice and even recognises faces [51]. The Parlo robot is presented in the picture in the middle of Figure 2.4. While Parlo was the first humanoid robot able to interact with PwD, the robot should be better utilised with people suffering from mild dementia. People with severe dementia could not enjoy the whole Parlo program because they were unable to engage in its conversations [51].

2.2.1.3 Nao

The Nao humanoid robot offers much more functionality than the two previously described robots. Nao is nice to look at, can walk, move its head, have lights and make sounds. Nao is most useful in physiotherapy, as it can perform physical exercises that can be directly mimicked by the patients [13]. The right picture in Figure 2.4 shows a Nao robot.

10 Because of the ease of use, the high availability and its toy-looking aspects, the Nao robot is the most suitable hu- manoid robot for robot-assisted therapy with PwD [13]. Neuropsychiatric symptoms tend to improve when the robot involves simple actions, for example, storytelling, music therapy, physiotherapy and logic￿language therapy. Patients following other types of therapies had less or even no improvement. [13][12]. During this dissertation, the Nao robot will be used because of these positive effects on the PwD at the NH. The free and open tools the robot manufacturer Aldebaran provides can be used to develop small applications for the Nao robot. In contrast, the Paro robot does not have the ability to interact with the PwD and the Parlo interfaces are a lot more closed, and his possible actions are rather limited. The high availability of this Nao robot and its toy-looking character are the other aspects which affected this decision.

Figure 2.4: Left: the Paro seal robot [50], Middle: The Parlo humanoid [51], Right: The Nao robot [52]

2.2.2 Robot interventions

While group therapy sessions using a robot are common nowadays, personalised person-centric interactions are almost never researched. The overall goal of an intervention is to track and get rid of a behavioural disturbance as quick as possible while therapy sessions are more likely to prevent these disorders. Both strategies increase the QoL of the PwD, and in both cases, a robot could assist the nursing staff to facilitate their daily routines. Healthcare robots have the potential to meet the needs identified by staff. They can help people with dementia in being entertained, stimulated, and calmed down, making life better for both nursing staff and residents [11]. While interventions are patient-specific, personalised actions are needed to ensure the better results. Several robots today can detect and recognise faces and can even analyse several characteristics based on voice and facial expressions [53]. A robot that is aware of the serviced patterns can recommend more related actions to their users, and the feedback for a particular action will be used to regulate its relevancy [54]. The methods behind these robotic actions have the ability to learn from their behaviour and more commonly known by artificial intelligence.

11 2.3 Artificial Intelligence

The research area to let a computer program or a machine think and learn is defined by Artificial Intelligence (AI) [55]. The ultimate goal of the AI research is to create programs that can learn, solve problems, and think logically just like humans do [56]. Such an intelligent machine is a flexible agent who perceives its environment and takes actions to maximise its chance of success at some goal [56]. This section gives a brief summary of the most common subfields within AI and their abilities to learn. A comparison between these subfields is made to define the most useful technique for this dissertation.

2.3.1 Bayesian network

While the environment, the area in which an agent operates, can be difficult to examine, most intelligent systems describe a model which is an abstract representation of their operating environment. This abstraction avoids many details for the agent, and decisions can be made more easily. One such a representation can be a Bayesian network where nodes represent random variables. When two nodes are connected by an edge, it has an associated probability that it will transmit from one node to the other. Inferences can then be used to deduct several relations between the nodes [56]. They are quite often used for medical diagnosis [57,58].

2.3.2 Expert system

Another AI technique uses information from a field expert to produce enough rules to act almost similar to this ex- pert. Such an expert system has the advantage to know a lot about the environment and this knowledge base can be extended or updated. These systems can be highly dynamic, and knowledge from within the problem field could be incorporated into the system [59].

2.3.3 Machine Learning

Instead of representing the model directly, it could be more beneficially to use data to represent the environment and analyse its changes under different circumstances. Techniques which uses data to replicate the environment’s model are more commonly known as machine learning methods and gained attention during the last decade [60]. Machine learning algorithms overcome strictly static program instructions by making data-driven predictions or decisions [61]. Machine learning tasks are usually divided into three categories, depending on the available data [56]: • Supervised learning: Techniques to define an algorithm to learn the mapping function from the input values to the desired output values are called supervised learning methods. The goal is to approximate the mapping function so well that when new input data is used, the algorithm can predict the output variables for that data [62]. Applications where historical data predicts future events uses frequently supervised learning techniques [63]. The main reason to use a supervised learning method is the availability of labelled data. Supervised learning problems can be further grouped into regression and classification problems [62]. – A classification problem is when the output variable is a category, such as “red” or “blue” or “disease” and “no disease”. – A regression problem is when the output variable is a real value, such as “dollars” or “weight”. • Unsupervised learning: When only the input data, without a corresponding output, is given, unsupervised learning techniques will try to model the underlying structure or distribution in the data to learn more about it. Unlike supervised learning

12 above, there is no correct answers and algorithms are left to their devises to discover and present the interesting structure in the data. Unsupervised learning problems can be clustering or association problems [62]. – Clustering problems try to discover the natural groupings in the data, such as grouping customers by their purchasing behaviour. – Association rule learning tries to determine rules that describe large portions of the data, such as people that buy X also tend to buy Y. • Reinforcement learning: Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas supervised learning has a target label for all the available data and unsupervised learning has no labels at all, reinforce- ment learning has sparse and time-delayed labels, called rewards. Based on those rewards, the agent has to learn to behave in the environment. Reinforcement learning is an important model of how we, humans and all animals in general, learn. Praise from our parents, grades in school, salary at work are all examples of rewards [64]. Reinforcement learning techniques are frequently used in the domain of robotics because of the promise of enabling autonomous robots to learn large repertoires of behavioural skills with minimal human interven- tion [65].

2.3.4 Comparison

The previously described AI methods use either a representative model of the environment or a large amount of data describing this environment. During this dissertation, knowledge on how a certain patient reacts to several different BDs is not given and even difficult to analyse. A Bayesian network usually exploits these given probabilities. In our case, this information is not available and could even not be detected due to the high diversity of patients and possible interactions. Using the knowledge from the nursing staff is another option to gather enough information about the different patient types and their reaction onto the interventions. Building an expert system which acts identically to a real nurse could be beneficial in this context. However, these expert systems have not the ability to learn from mistakes and have difficulties in mimicking the exact knowledge of an expert when the proposed tasks are complex [59]. Learning to operate in a certain situation is such a difficult task because of the unpredictability of the different patients. A more useful option is to use data to represent the environment and the behavioural patterns of the PwD. Supervised techniques would be very beneficial to determine when an intervention had a positive or adverse effect on the patient. Based on this data, learning models could be trained to act upon a new situation. Labelled data for this problem is, however, not available. Gathering this information would be difficult because of different interpretations of whether a certain intervention had a positive or adverse effect onto the PwD. Gathering unlabelled data is much easier because the available sensors in a NH can be used to collect raw data without interpreting their values. Unsupervised learning methods use this kind of data to detect underlying patterns. In this dissertation, we are not interested in these hidden patterns between the action preferences and the patients but want to search the most beneficial action in many different situations. Reinforcement learning is in this perspective something in between supervised and unsupervised learning. The prob- lem in this dissertation can be described by maximising the elicits of positive feelings by choosing the most appropriate action within a set of possible actions. This issue can be easily translated into the more general reinforcement learning problem, which tries to maximise the feedback received from executing an action. Reinforcement techniques will find a balance between trying different things to see if they are better than what has been tried before or trying the things that have worked the best in the past. Supervised learning techniques do not perform this balance and are purely exploitative [66]. Changes in behavioural patterns are quite common for PwD [67]. A purely exploitive method would

13 not be able to detect these changes and retraining will be necessary. When such a behavioural change occurs is never known upfront. Therefore, a supervised learning technique would make our system less useful. Reinforcement learning is currently the most interesting technique for learning the personalised interaction in differ- ent situations and for different PwD due to the several drawbacks of the other described methods. The concepts of reinforcement learning are described more in depth in Chapter 3.

14 3 REINFORCEMENT LEARNING

Whether we, humans, are learning how to ride a bike or learning how to communicate with other people, we are aware of how our environment responds to what we do, to which actions we took. Eventually, these responses will give us enough information to adapt our behaviour and influence our learning process. Learning from interactions is a foundational idea underlying nearly all theories of learning and intelligence. Reinforcement learning is such a learning from interactions, where actions are chosen in certain situations or states to maximise a reward signal [68]. This chapter describes the fundamental concepts to solve reinforcement learning problems. Different methods, find- ings and concepts from Sutton et al. [68] and Silver [69] were combined in this chapter to give a brief overview of the possibilities of reinforcement learning. This chapter will first describe the general concepts of a reinforcement learning algorithm in Section 3.1. Two main categories of reinforcement learning problems differentiate by the ability to model the environment or not. Solutions for both subfields are given in Section 3.2 and Section 3.3. When informa- tion from the environment is not needed to solve the reinforcement learning problem, simpler solutions exists. These simple reinforcement learning problems are discussed in Section 3.4. At last, techniques from both supervised and unsupervised learning can be used in combination with reinforcement learning to predict the action preferences when the environment changes fast or the action space is infinite. These more advanced techniques are summarised shortly in Section 3.5.

3.1 General Conceptes

In essence, reinforcement learning problems are called closed-loop problems, because the selected actions influence the following states. The goal of a reinforcement learning problem is the find those actions or series of actions which yield the most reward. The learning algorithm is not told which steps to take, but must discover which actions yield the most reward by trying them out. The learner or sequential decision maker is called the agent. The goal of this agent is to select actions which maximise the total future rewards. Everything outside the agent is known as the environment and the agent can interact with this environment by performing actions.

The interactions between an agent and the environment are described as follows: the agent executes an action at at time step t, and the environment responds to these actions by presenting a new situation or state to the agent, st. The environment also returns a numerical value to identify how right or wrong the previously executed action at was. The agent tries to maximise these numerical values or reward values rt over time by selecting more appropriate actions in time step t + 1. A more schematic overview of a reinforcement learning problem is given in Figure 3.1.

Figure 3.1: The agent-environment interaction in reinforcement learning.

15 The whole reinforcement learning problem is based on sacrificing immediate rewards to gain a higher long-term re- ward because actions can have long-term consequences while rewards only reflects the current state without the knowledge of previously executed actions.

3.1.1 Conceptual details

Details about how several components are related to each other are given in this section. Mathematical descriptions are given to specify the underlying concepts where needed.

3.1.1.1 Agent-Environment interaction

At each time step t = 0, 1, 2,... , the agent receives some representation of the environment st ∈ S where S is the set of all the possible states. Based on this state, the agent selects an action, at ∈ A(St) where A(St) is the set of actions available in state st. At time step t + 1, as a consequence for the action took in time step t, the agent receives a numerical reward rt+1 ∈ R ⊂ R and receives or senses a new state st+1. The whole interaction repeats.

3.1.1.2 Policy

At each time step, the agent implements a mapping from states to actions, or more specific, calculates the probabilities of selecting each possible action. This mapping is called a policy and denoted by πt when used at time step t. The probability of selecting action a = at in state s = st is denoted by πt(a|s). The agent’s goal is still to maximise the total amount of reward it receives over the long run or in other words, finding the optimal policy. Reinforcement learning methods specify how the agent changes its policy as a result of its received experience.

3.1.1.3 Markov Decision Process (MDP)

A reinforcement learning problem that satisfies the Markov property is called a Markov Decision Process (MDP). The chosen action influences the probability that the process moves into its new state s′. Specifically, it is given by the ′ ′ state transition function Pa(s, s ). Thus, the next state s depends on the current state s and the decision maker’s action a. Given s and a, it is conditionally independent of all previous states and actions and therefore satisfies the Markov property. A finite MDP, where the state and action spaces are finite, is defined by these state and action sets and by the one-step dynamics of the environment. Given any state s and action a, the probability of each possible pair of next state s′ and reward r is denoted by Equation 3.1.

′ ′ p(s , r|s, a) = P r{st+1 = s , rt+1 = r|st = s, at = a}, (3.1)

These quantities completely specify the dynamics of a finite MDP. Given the dynamics as specified by Equation 3.1, one can compute anything else one might want to know about the environment, such as the expected rewards for all state-action pairs: ∑ ∑ ′ r(s, a) = E[rt+1|st = s, at = a] = r p(s , r|s, a) (3.2) r∈R s′∈S

3.1.1.4 Value function

The sequence of received rewards can be denoted by rt+1, rt+2, rt+3, ... after t time steps. The goal is still to maximize the expected return for the next time step, where this expected return gt is defined as some specific function of the reward sequence. In the simplest case, the return is the sum of the rewards:

gt = rt+1 + rt+2 + rt+3 + ··· + rT , (3.3) where T is the final time step. This approach makes sense in applications in which there is a natural notion of a final time step. When this is the case, the agent-environment interaction breaks naturally into subsequences, which are called episodes.

16 In many cases, the agent-environment interaction does not break naturally into separable episodes but are continuous. The expected return in Equation 3.3 is problematic for these continuous tasks, because maximising the return with T = inf could easily be infinite. This problem can be solved by discounting future rewards. The agent tries to select actions so that the received sum of the discounted rewards is maximised. In particular, it chooses at to maximise the expected discounted return as denoted by Equation 3.4. ∑∞ 2 k gt = rt+1 + γrt+2 + γ rt+3 + ··· = γ rt+k+1, (3.4) k=0 where γ is called the discount factor and 0 ≤ γ ≤ 1. This discount factor determines the present value of future rewards. A reward received after k time steps is worth only γk−1 times when received immediately. If γ < 1, the infinite sum has a finite value as long as the reward sequence rk is bounded. If γ = 0, the agent will only maximise immediate rewards. The agent’s objective in this case is to learn how to choose at to maximize only rt+1. Before the reinforcement learning problem can be solved, it would be beneficial if the agent knows how good it is to perform a given action in a particular state. The notation of “how good” is defined concerning the future rewards that can be expected, or more precisely, the expected return. Of course, all the future rewards the agent can expect to receive, depend on what actions it will take. The action selection procedure is defined by the policy and is the mapping from states to actions. Informally the value vπ(s) of a state s under a policy π is defined as the expected return when starting in s and follow that policy π from then on. vπ(s) is more formally defined in Equation 3.5. [ ] ∑∞ k vπ(s) = Eπ[gt|st = s] = Eπ γ rt+k+1|st = s , (3.5) k=0 where E[·] denotes the expected value of a random variable given that the agent follows the policy π. The function vπ is called the state-value function for policy π. Analogue, a similar function which defines the value of taking action a in state s under a policy π, is denoted by qπ(s, a): [ ] ∑∞ k qπ(s) = Eπ[gt|st = s, at = a] = Eπ γ rt+k+1|st = s, at = a , (3.6) k=0 qπ is called the action-value function for a policy π.

3.1.1.5 Bellman equation

For any policy π and any state s, the following consistency condition exists between the value of s and the value of its possible successor states:

v (s) = E [g |s = s] π π[ t t ] ∑∞ k = Eπ γ rt+k+1|st = s (see equation 3.5) [ k=0 ] ∑∞ k = Eπ rt+1 + γ γ rt+k+2|st = s k=0 [ [ ]] ∑ ∑ ∑ ∑∞ ′ k ′ = π(a|s) p(s , r|s, a) r + Eπ γ rt+k+2|st+1 = s a ′ r ∑ ∑s [ ] k=0 ′ ′ = π(a|s) p(s , r|s, a) r + vπ(s ) , ∀s ∈ S, (3.7) a s′,r

17 Note how the final Equation 3.7 can be read very easily as an expected value. It is a sum over all values of the three variables, a, s′, and r. For each triple, the probability, π(a|s)p(s′, r|s, a) is computed, weight the quantity in the brackets by that probability and sum then sum over all the possible actions to get an expected value.

Equation 3.7 is the so-called Bellman equation for vπ and expresses the relation between the value of a state and the value of its successor’s states. This Bellman equation is a fundamental property used throughout reinforcement learning.

3.1.2 Solving reinforcement learning problem

A reinforcement learning problem is solved when the optimal policy is found. The optimal policy is the one which achieves the most reward over the long run. For finite MDPs, this optimal policy can be defined as follow: A policy π is defined to be better than or equal to a policy π′ if its expected return is greater than or equal to that of ′ ′ π for all states. In other words, π > π if and only if vπ(s) ≥ vπ′ (s) for all s ∈ S. There is always at least one policy that is better than or equal to all other policies. These are called optimal policies, denoted by π∗. They share the same state-value function, denoted by v∗. This optimal state-value function is defined by Equation 3.8. . v∗(s) = max vπ(s) (3.8) π

Because v∗ is the value function for a policy, it must satisfy the self-consistency given by the Bellman equation for state values, given in Equation 3.7. The optimal value function consistency condition can be written in a particular form without the reference the need for any specific policy. This form is called the Bellman optimality equation and expresses the fact that the value of a state under an optimal policy must equal the expected return for the best action for that state.

v∗(s) = max qπ∗ (s, a) ∈ a A(s) [ ] = max Eπ∗ gt|st = s, at = a a [ ] = max E rt+1 + γv∗(st+1)|st = s, at = a a ∑ ′ ′ = max p(s , r|s, a)[r + γv∗(s )]. (3.9) a∈A(s) s′,r

The Bellman optimality equation is a system of equations, one for each state. If there are N states, then there are N equations in N unknowns. If the dynamics of the environment, p(s′, r|s, a), are known, then in principle one can solve this system of equations for v∗, using any one of a variety of methods for solving systems of nonlinear equations.

Once one has v∗, it is relatively easy to determine an optimal policy. For each state s, there will be one or more actions at which the maximum reward can be obtained. Any policy that assigns a nonzero probability to these actions only is an optimal policy. An agent that can learn the optimal policy by solving the Bellman equations is very useful, but in practice, this rarely happens. Even if a complete and accurate model of the environment’s dynamics is available, it is usually not possible to simply compute an optimal policy by solving the Bellman optimality equation due to the huge amount of parame- ters.This massive amount of parameters is often referred as the curse of dimensionality. The available memory is a significant constraint, and large sets of equations will need additional computing power. A smaller amount of memory is often required to build up approximations of value functions, policies, and models. Therefore, approximations of value functions are frequently used in reinforcement learning algorithms.

18 3.2 Model-based solutions

A finite MDP described the environment where both the state and action sets were finite, and the dynamics of the system were given by a set of probabilities p(s′, r|s, a). If these two conditions are met, the reinforcement learning problem is a so-called model-based problem, because the MDP can fully observe the environment. Dynamic programming techniques can be applied onto these MDPs to obtain the optimal policies and value functions. The key idea of this dynamic programming approach is to use the value functions to organise and structure the search for better policies using the Bellman equations defined in Section 3.1.1.5. Iterative solutions exist to solve these Bellman optimality equations, and the two most used iterative algorithms are policy iteration and value iteration.

3.2.1 Policy iteration

′ Policy iteration will improve a policy π using vπ to yield a better policy π , and can then compute vπ′ to improve it again and yield an even better policy π′′ and so on. A sequence of monotonically improving policies and value functions is obtained −→E −→I −→E −→I −→·E · · −→I −→E π0 vπ0 π1 vπ1 π2 π∗ v∗

E I where −→ denotes a policy evaluation and −→ denotes a policy improvement. Each policy is guaranteed to be a strict improvement over the previous one, unless it is already optimal. A finite MDP has only a finite number of policies, for which can be concluded that this process must converge to an optimal policy and optimal value function in a finite number of iterations. The drawback of policy iteration is that in each iteration a policy evaluation step is included, which may itself require an iterative computation. When the policy evaluation is executed iteratively, vπ converges only after an infinite number of steps. In practice, the policy evaluation step during policy iteration can be truncated in several ways without losing the convergence guarantees of policy iteration.

3.2.2 Value iteration

The special case where the policy evaluation stops after exactly one iteration is called value iteration. It can be written as an operation that combines the policy improvement and truncated policy evaluation steps:

vk+1(s) = max E[rt+1 + γvk(st+1)|st = s, at = a] a ∑ ′ ′ = max p(s , r|s, a)[r + γvk(s )], a s′,r for all s ∈ S. For arbitrary v0, value iteration can be shown to converge to v∗ under the same conditions that ensure the existence of v∗.

3.3 Model-free solutions

As opposed to the model-based solutions where the dynamics of a MDP are known, complete knowledge of the envi- ronment is not assumed when considering the reinforcement learning problems in this section. Sometimes, the state and action sets can be continuous, or the state transition probability function can be hard to define. Reinforcement learning is primarily concerned with how to obtain the optimal policy when a model of the environment is not known in advance.

19 The agent must interact with its environment directly to get valuable information which, using an appropriate algo- rithm, can be processed to yield an optimal policy. These algorithms are so called model-free solutions. Estimations of the value function for these unknown MDPs are needed to find the optimal policy before it can be optimised.

3.3.1 Model-free prediction

When the MDP is assumed to be episodic, which is a terminating sequence of events, Monte Carlo or Temporal Difference TD methods can be used to solve this reinforcement learning problem.

3.3.1.1 Monte Carlo method

Monte Carlo methods sample and average the returns for each state-action pair and average the rewards gathered for each executed action. However, the return after taking an action in one state depends on the actions taken in the following states of the same episode. Because all the selected actions can induce changes in the environment, the problem becomes non-stationary for the earlier states. The general policy iteration is adapted to handle this non-stationarity, but the value functions are learned from the sample returns instead of using the computed value functions from a given knowledge of the MDP. The term generalised policy iteration will be used to refer to the general idea of letting policy evaluation and policy improvement processes interact, independent of the granularity and other details of the two processes seen in section 3.2.

Suppose vpi(s) needs to be evaluated, the value of a state s under policy π. The set of episodes is obtained by following the π and passing through s. The occurrence of s in an episode is called a visit of s. The value of a state is the expected return starting from that state. A clear way to estimate it from experience is simply to average the returns observed after each visit to that state. As more and more returns are observed, the average should converge to the expected value. This idea underlies all Monte Carlo methods.

The current total return can be easily computed incrementally by S(s) ← S(s) + gt, where gt is defined by equa- tion 3.4. The value of the state s is estimated by the mean return V (s) = S(s)/N(s) and due to the law of large number, V (s) converges to vπ(s) as N(s) goes to infinity. While the mean value is here computed immediately, incremental mean calculations will be more beneficial because of the iterative characteristics of the policy evaluation and iteration steps. The incremental mean calculation is given in Equation 3.10. 1 V (St) ← V (St) + (gt − V (St)) (3.10) N(St) gt is the actual return and the incremental Monte-Carlo algorithm will update the value of V (St) towards this actual return.

3.3.1.2 Temporal Difference method

Whereas Monte Carlo methods must wait until the end of the episode to determine the increment to V (St), only then is gt known, there exist methods that only have to wait until the next time step to calculate these values. These methods use a combination of Monte Carlo and dynamic programming ideas and more commonly known as Temporal Difference methods.

At time t + 1, a TD method immediately makes a useful update using the observed reward rt+1 and the estimate V (St+1) using following equation:

V (St) ← V (St) + α[rt+1 + γV (St+1) − V (St)] (3.11)

Because the TD method bases its updates on the current estimate, it compares with a bootstrapping method. In con- trast to Monte Carlo methods, TD can learn the update values before knowing the outcome of the episode and even

20 without knowing the final result as shown in Equation 3.11. TD can learn from so-called: incomplete sequences. How- ever, Monte Carlo methods are less sensitive to the initial value and have some good convergence properties. Monte Carlo methods are usually more efficient in environments which do not fulfil the Markov properties. The TD method described in Equation 3.11 is the so-called TD(0) estimator which will only update the current state by observing the next state. Looking n-steps into the future can be more desirable. Equation 3.12 describes this n-step temporal-difference learning estimator. ( ) ← (n) − V (St) V (St) + α gt V (St) (3.12) with: (n) ··· n−1 n gt = rt+1 + γrt+2 + + γ rt+n + γ V (St+n)

Instead of using these n-step returns directly, it is more common to average these n-steps returns over many n. For example, the average the 2-step and 4-step returns as described in Equation 3.13 1 1 g(2) + g(4) (3.13) 2 2 However, it would be more beneficial to combine the information of all these time-steps. The so-called λ-returns are n-step return values multiplied with the weight (1 − λ)λn−1. ∑∞ λ − n−1 (n) gt = (1 λ) λ gt (3.14) n=1 moreover, analogue as in TD(0), the more useful form TD(λ) is given in Equation 3.15. ( ) ← λ − V (St) V (St) + α gt V (St) (3.15)

TD(0) is a special case of TD(λ). When λ = 0, only the current state will be updated, and this is equivalent with Equation 3.11. When λ = 1, all time-steps equally weights until the end of the episode. Therefore TD(1) is roughly equal to Monte Carlo updating with the small difference that Monte Carlo methods learn nothing from an episode until it is over. Monte-Carlo updates are so-called offline update techniques. The updates are accumulated during the episode but only applied in batches at the end. TD(λ) are online, incremental update methods. If something right or wrong happens during an episode, control methods based on TD(λ) can learn this behaviour immediately and make changes during the episode.

3.3.2 Model-free control

When a model is predefined, state values alone are sufficient to determine a policy. The agent simply looks ahead one step and chooses whichever action leads to the best combination of rewards. Without such a model, one must explicitly estimate the value of each action before suggesting a possible optimal policy.

The problem in the evaluation of these action values is to estimate qπ(s, a), the expected return when starting in state s, taking action a, and after that following policy π. The state-action combinations can be estimated by Equa- tion 3.16. π π q (s, a) = E[rt+1 + γV (St+1)|st = s, at = a] (3.16) There are two common techniques to find the optimal state-action values.

21 3.3.2.1 Sarsa

In Sarsa, the initial q values are randomly selected and updated in every time step k according to Equation 3.17.

q(sk, ak) = (1 − α)q(sk, ak) + α[rk+1 + γq(sk+1, ak+1)] (3.17)

Sarsa is an on-policy method and will learn the action values resembling the policy it follows.

3.3.2.2 Q-learning

In Q-learning, the initial q values are randomly selected and updated in every time step k using Equation 3.18.

q(sk, ak) = (1 − α)q(sk, ak) + α[rk+1 + γ max q(sk+1, ak)] (3.18) a Both techniques have their advantages for a particular type of reinforcement learning problems. Q-learning will learn the action values corresponding a greedy policy by selecting the best possible state-action value in the calculation of the update. Q-learning has the disadvantage to converge slower than Sarsa, but can react more easily to changing behaviours [65].

3.4 Bandits

Problems exist where the agent does not have to act in more than one situation. These reinforcement learning problems can be solved more easily than the ones described in the previous sections because of a reduced state space. Acting in only one situation can be illustrated by a simple casino slot machine where the agent, a gambler tries, to maximise his or her revenue. Consecutive handle pulls will eventually reveal some knowledge about this probability distribution, and based on this gathered information, the gambler can decide his or hers next action, namely pull again or stop playing. Because a casino always wins in the end, these slot machines were called bandits. A k-armed bandit problem is defined by a tuple of k actions. Instead of a single lever, there are k different handles or action and the unknown probability distribution over the rewards is given by Equation 3.19

Ra(r) = P[r|a], (3.19) with a an action. At each time step t, the agent selects an action at ∈ A and a reward is generated according to the selected action by the environment.∑ The goal of a k-armed bandit algorithm is to select those action which maximises τ the cumulative reward, t=1 rt. Because the probability distribution of the rewards is unknown, the action-value combination can be estimated simi- larly as Equation 3.16 but without the state parameter because the bandit acts only in one situation.

q(a) = E[r|a]

The optimal value V∗ follows directly from this equation by selecting the best action:

V∗ = q(a∗) = max q(a) a∈A

The performance of a bandit problem is mostly expressed in the form of the total regret which is the total expectation of the differences between the optimal value and a selected action in each time step. Maximising the cumulative reward equals minimising the total regret. The total regret function defined in Equation 3.20 can be written in function of

22 the expected number of selections for action a and the difference in value between action a and the optimal action a∗. [ ∑τ ] Lt = E V∗ − q(at) (3.20) ∑t=1 = E[Nt(a)](V∗ − q(a)) a∈A with Nt(a) the number of selection for action a. The optimal action must be known a priori when the difference in values between action a and the optimal action a∗ is calculated, and this is never the case. Otherwise, solving a k-armed bandit problem would be trivial.

3.4.1 The exploration-exploitation paradigm

The bandit algorithm will have to explore the action space to find the best possible action. Exploring the action space and deciding when to exploit the current best action is one of the major concerns in a bandit problem. Several different exploration-exploitation tactics are currently available, but the best policy depends on the problem definition. Each of these policy algorithms estimates the value of qˆt(a) ≈ qt(a). The value of each action is estimated by a Monte Carlo estimation, analogue to the Model-free estimation in Section 3.3.1.

1 ∑T qˆ (a) = r 1(a = a) t N (a) t t t t=1

The exploration-exploration paradigm became serious when bandit algorithms were massively used in online adver- tising. Many advertising companies wanted to know which advertisements should be exposed to the visitors of their sites to maximise their revenue. Knowledge about the click rate of these users in relation with exposed advertisement was gathered by exposing different advertisements using a predefined policy. This approach explores the different revenue probabilities. When enough certainty about the advertisements was gathered, companies starts to show the best commercials on their sites. The gathered knowledge during the exploration provides a base for the exploiting phase. Advertisements change over time, but also visitors and the revenues patterns evolve in time. A balance between exploring the possibilities and exploiting the best possible action is the main difficulty of a bandit problem.

3.4.2 Contextual bandits

The previously defined bandit problem tries to find a single best action when the task is stationary or can track the best action as it changes over time when the task is non-stationary. However, in a general reinforcement learning problem, there is usually more than one situation to act, and the goal is to learn a policy for all these different states. Suppose there are several different bandit problems, and that at each time step, a random one occurs because the bandit task changes randomly from step to step. Unless the real action values change slowly over time, the previously defined methods will not work very well. If the currently selected task is defined properly, there is some distinctive clue about its identity, but the action values still stay hidden. A policy should be learned, associated with every task, to take the best possible action when facing the task, signalled by the given distinctive clue. With such a policy, one can usually distinguish one bandit task from another much better than when additional information is absent. These problems are so-called associative tasks, because they involve both trial-and-error searches for the best actions and associate these actions with the situations for which they could be used the best. These associative tasks are in between bandit problems and the full reinforcement problem. If both the actions as well as the rewards are allowed to affect the next situation, a full reinforcement problem is described. This problem type is shown in the bottom diagram of Figure 3.2 and describes the case where the action affects the state, and rewards can

23 be delayed in time. If each action only affects the immediate reward, a bandit definition, shown in the top diagram of Figure 3.2, suffices. The contextual bandit problem, which associative tasks are more in general, has many applications and is often more suitable than the standard bandit problem. Environments where no contextual information can be used efficiently, are rare in practice. Figure 3.2 shows the diagram for these contextual bandit problems in the middle, where both one state and the action affects the reward. The simple exploration-exploitation algorithms are adapted to take into account this extra information [70,71].

Figure 3.2: Overview of the differnt reinforcemnt learning problems [72].

3.5 Deep Reinforcement Learning

With the recent achievements in deep learning, benefiting from the big data hype, powerful computation and new algorithmic techniques, reinforcement learning has gained extra attention during the last couple of years. Especially, the combination of reinforcement learning and deep neural networks, i.e., deep reinforcement learning [73]. In Section 3.3.2, the Q-function was built based on the Bell-Equation by using the Q-value iteration update function in Equation 3.16. When there are a lot of different states with a limited number of actions, the table associated with this function, the so-called q-table, will be too large to get stored, and the algorithm will be unable to work properly. For example, a picture of size 80 × 80 grayscale pixels represents the possible different states of a reinforcement learning problem. 256 different values per pixel are possible, and there are in total 6400 pixels. This learning problem has 2566400 possible states. When there are only four possible actions, 2566400 × 4 different Q-values will have to be stored. There is no longer need to store all state-action Q-values using deep neural networks. Deep neural networks work well for inferring the mapping implied by data, giving them the ability to predict an ap- proximated output from any new input. These networks also automatically learn a set of internal features which are useful in complex non-linear mapping domains, such as image processing. While single action values can easily be predicted with these types of neural networks, the Deep-Q network from

24 Google Deepmind can output all the Q-values for a given state for a single forward pass, resulting in a single forward pass per optimal future value estimate [74].

3.5.1 Meta reinforcement learning

Deep reinforcement learning has managed to learn sophisticated behaviours automatically. However, the learning process requires a huge number of trials. In contrast, humans can learn new tasks in just a few trials benefiting from their prior knowledge about the world. The previous sections describe the learning process for situations where the environment is almost identical. Imagine an agent was trained to solve a two-armed bandit task in which one arm always provides a positive reward, and the other arm always provides no reward. Using any reinforcement learning algorithm such as Q-Learning, the agent can quickly learn to choose the arm with the positive reward. If we take this trained agent and install a nearly identical bandit problem except with the values of the arms switched, the agent will perform worse since it will simply pick whatever it believed to be the real arm before. In the traditional reinforcement learning paradigm, our only recourse would be to train a new agent on this new bandit problem. Meta-reinforcement learning (or RL2) tries to solve this problem by learning a policy for learning new policies. An agent could be trained to all similar bandits it may encounter in the future as well. Meta-reinforcement learning uses a recurrent neural network to learn and exploit the structure of the problem dynamically. By simply adjusting its hidden state, it can solve new problems without the need for retraining [75].

3.6 Main findings

The overview in Figure 3.2 clearly describes the differences between the multiple reinforcement learning problems de- fined in this chapter. First, full reinforcement learning problems were described in detail to show the full power of this learning technique. A more simple representation of a reinforcement learning problem was given in Section 3.4. Learning from interactions can be defined by such a simple bandit problem when the actions are executed per in- tervention. The execution of an action during such an intervention will not involve the performance of further ones. This problem is more commonly known as a non-associative problem with a time horizon of zero because the current events will not affect events later on. Different bandit algorithms can be designed and compared to determine the most beneficial ones for our problem. A learning system using a bandit algorithm will be able to learn the action preferences for a single patient. In a NH, multiple patients live side by side, and each patient will need such a bandit system to define their action preferences. Due to the exploration-exploitation paradigm, the bandit algorithms will try all the different actions in the action space. Learning rates can converge slowly when this action space is large or when the rewards values are inaccu- rate. Other techniques which still offers a personalised interaction, but learn on a more global scale could be beneficial to solve the issues of multiple bandit algorithms. Learning globally can be performed by using contextual information about the patients and their environment. Contextual bandit algorithms will be useful in this respect, and their con- cepts are built upon these of the general bandits. Adding the additional layer of information on top of these general bandit algorithms could be beneficial for our learning system. Full reinforcement learning techniques will not be necessary during this dissertation because effects over multiple interventions will be hard to analyse. The situations of the PwD will differ a lot between these interventions. Dealing with those differences is more important than performing actions to acquire a high long-term reward.

25 4 THE WONDER PROJECT

Chapter 2 discussed the benefits of using non-pharmacological treatments instead of medication in a NH residing PwD. Many different therapies, nowadays mostly applied by a social robot, exist which can prevent or alleviate BDs. However, individualised robotic care applications are rather scarce because the preprogrammed interactions and long set-up times inhibit the nursing staff to apply these sessions for every patient individually. The WONDER project will build a system where such individualised non-pharmacological interventions with a robot are possible. This small chapter gives first some general information about the WONDER project in Section 4.1. This dissertation is part of this project, and its contribution is explained in Section 4.2.

4.1 Project background

WONDER proposes interventions for Wandering and Other behavioural disturbaNces of people with DEmentia in nurs- ing homes by personalised Robot interactions. This project will build a non-pharmacological intervention system by implementing a service where the humanoid Zora robot [76], which is a care application build on top of the Nao robot, elicits personal memories associated with positive feelings. As PwD revisit the past more and more, Zora providing personalised interactions with the residents will have a positive effect on PwD and in turn prevent or alleviate the BDs [16]. Today, Zora is already successfully used for entertainment and therapeutical purposes in NH. Inspired by the enthusi- astic reactions of residents and caregivers to Zora, the question arose during feedback sessions with nursing staff if Zora could be integrated into the daily care processes for prevention and alleviation of BDs. The WONDER project consists of two important technological pillars: • First, the manifestation of a BD must be automatically detectable. Probabilistic pattern recognition algorithms are researched to process the raw data coming from sensors, wearables and robots into a BD detection system. Next, the output of these algorithms can be combined with the available information about the PwD together with some knowledge about the current context. The final system has the ability prioritise the detected mani- festations and assesses their validity with more accuracy. • Second, an intervention strategy must ensure that the Zora robot has a 24/7 coverage of the NH floor. In the WONDER project, the Zora robot works semi-autonomous and walks or rides from one resident to another to start a personalised interaction instead of being manually activated and executing some predefined sequences of instructions. The WONDER system will let the care coordinator creates a profile with personal information about the residents with, for example, their favourite song or a youth memories. He or she can also input the organisational information (menu, NH map, activity/meal timetable). Combined with information provided by sensors and the customizable wearable, a per-resident intervention strategy is determined. In acute situations, the staff is alerted, or the robot is immediately sent to distract the PwD temporarily. Besides, pro-active interventions are scheduled collaboratively over multiple robots throughout the day/night (24/7), taking into account the limitations of Zora. A schematic overview of the full WONDER project is given in Figure 4.1.

26 Figure 4.1: Conceptual architecture of WONDER project .

4.2 Contribution of dissertation

Many different parts of this WONDER project are eventually combined to build the intervention system. When the robot is sent to distract the PwD temporarily, it has a fixed number of actions which it can use. Currently, the executed action is determined by a predefined set of rules using information about the PwD and the current situation analysed by the available sensors. Using such a rule-set is fixed, and updates will be required when behavioural patterns changes or new PwD are admitted. More global rules could help to reduce these amount of updates, but this will affect the personalised interactions. This dissertation investigates if it is possible to learn the preferable actions for these PwD. A personalised intervention system which can learn from its interactions guarantees to elicit positive effects while changes in behavioural patterns can be noticed and adaptations after every executed intervention can be made. Chapter 2 and Chapter 3 discussed which learning techniques can be used to determine these action preferences.

27 5 SCENARIOS

The eventually proposed learning system will determine how to select an action to resolve a specific BD and analyse the effects on the PwD during the action’s execution. BDs occur in many different situations. The learning system will have to deal with these various situations to make sure it executes the most appropriate actions. This dissertation will mainly focus on the wandering and screaming BDs. The designed concepts will, however, be applicable on other BD as well. In this chapter, several use cases scenarios will be explained to illustrate how a learning system should operate based on the elementary concepts of reinforcement learning described in Chapter 3. After the general flow of an intervening scenario is described, the other use case scenarios are listed in order of functionality and necessity for a NH. The last couple of scenarios are illustrations of more advanced problem-solving use cases.

5.1 General scenario

When an intervention is needed, and the robot is sent to resolve this problem, it will drive autonomously to the PwD and starts an intervention. All the scenarios start from the intervention point of view, and the robot movement patterns, patient recognition and BD detection system are used inherently. The details of these components and their current availability are neglected. During the first steps of the intervention, the robot will have to select an action to execute. The chosen action is based on the action selection procedure, defined as a policy in Section 3.1.1.2. The robot can execute the selected action to attract the PwD. The reactions of the PwD will be monitored and analysed during the execution of the chosen action by the robot. When the intervention finishes, which is when the executed action finishes, the learning system will determine if the executed action had a positive or adverse effect on the PwD based on the gathered information during this intervention. All the gathered data can be combined in a so-called feedback value or reward signal and this signal will influence the action selection procedure whenever another intervention for this particular PwD is needed. A schematic overview of this flow is given in Figure 5.1.

Figure 5.1: Flowchart representation of the general intervention scenario.

How the reactions of the PwD can be monitored and analysed will be discussed in a next chapter. The goal of these scenarios is to show the link between a reinforcement learning problem and the problem defined in this dissertation. The robot can act here as an agent that has to determine which actions should be executed to receive the highest positive effect from the PwD.

5.1.1 Actions

During an intervention, the robot will have to execute an action to attract the attention of the PwD. The robot manufac- turer defines the possible actions, or these actions are designed by companies which gained a high level of expertise in these robotic applications [76]. Examples of these actions are: • Sing a song: the robot will start singing a song and will try to encourage the patient to sing together.

• Tell a story: the robot tells a story with sound effects to keep the attention of the patient.

28 • Dance: the robot does a little dance.

• Play some music: the robot plays a song from a given period which can elicit positive memories.

• Show a picture: when the robot can show visual content, it can be used to show different pictures from, for example, family members and says something about them.

• Small talk: the robot talks with the patient about the weather or its daily activities. Each of these actions can be divided into multiple sub-actions where, for example, different songs are possible to sing. The goal of our learning system is not to differentiate between these sub-actions because the nursing staff or family members can easily determine, for example, the patient’s song preferences. Some action will be more beneficial in a situation than others, and these action preferences are also different for all PwD. The goal of this dissertation is to determine when to use a particular action.

5.2 Screaming behaviour

This scenario describes the situation where a PwD started to scream. There are no pain indicators, and the PwD has not fallen. The robot can execute some predefined actions to stop the current behavioural disturbance. One or multiple actions can be appropriate in this situation. Which ones are currently unknown. The robot selects a single action and measures a feedback signal during this action through sensors available on the robot and in the rooms and hallways of the NH. When the intervention finishes, all the gathered sensor values are analysed to determine if the executed action had a positive effect on the PwD or not. The intervention executes according to the same pattern shown in Figure 5.1 but now using a loop to detect the end of the intervention. When the intervention finishes, a positive effect can be assigned to the executed action when the PwD stopped screaming. A negative effect is assigned when the PwD still screams after the intervention. The assignment of such a positive or adverse reward will influence the learning process of the application. Figure 5.2 visualises this scenario.

Figure 5.2: Flowchart representation of the screaming behaviour intervention.

5.3 Wandering behaviour

Wandering behaviour can have several different occurrences and can become dangerous for the PwD in several situa- tions. PwD have a purpose for these wanderings, and may be agitated about whatever that reason is. The reason can be real or imagined, but the patient’s emotional state is no less disrupted when the threat is imagined. Wandering can

29 also occur when the patient is unaware of real surroundings and proceeds to wander according to an imagined envi- ronment [77]. This behaviour can fatigue the patients, and such wandering behaviours are therefore not appropriate. Wandering can be solved by gently interacting with these patients and attract their attention. The robot will execute an action to trigger the attention of the PwD. When the patient moves during the intervention, the robot can assume the wandering behaviour still continues, and the intervention is stopped. The nursing staff is alerted that the executed action was not sufficient for this PwD and someone has to look for this patient. When the PwD stays at his or her current location during the whole intervention, a positive feedback signal supports the choice for the current action. An overview of this scenario is given in Figure 5.3.

Figure 5.3: Flowchart representation of the wandering intervention.

5.4 Advanced wandering behaviour

How the robot should act upon the wandering behaviour of a PwD, differs in several situations. A more advanced approach handling this BD can, therefore, be appropriate. This advanced scenario describes a similar situation where a PwD wanders around in his or her room or hallway. During the night, the PwD will be lured back into his/her room, and the robot motivates the PwD to go back to sleep. During the day, the robot tries to lure the PwD to a general meeting place, for example, cafeteria or relaxation area. The robot will assign a positive feedback signal to the action which: • Leads the PwD to the most appropriate location during the intervention. • Decreases the distance between the PwD and a predefined area, but the PwD did not reach the location during the intervention. • Gets the PwD back to his/her bed during the night The different steps of this scenario are visualised in Figure 5.4 and show when positive reward values should be as- signed when the intervention was successful. Staff members will always be informed to analyse the robotic interven- tions. Additional sensors will be needed for both these wandering scenarios to determine the location of the PwD and to detect whether a PwD walks to the correct predefined place.

30 Figure 5.4: Flowchart representation of the wandering intervention.

5.5 Unwanted visiting behaviour

The situation where a PwD is in the room of another PwD could be negative for both, because this involves stress, distraction and sometimes aggression. The robot tries to get the attention of the invaded PwD by calling his or her name and lead them away from the other PwD. When the robot has lured the PwD outside the room, a positive reward will be assigned. Next, the goal is to remain the PwD at a certain location until a staff member has arrived. This scenario shows the ability to combine multiple interventions to get the desired behaviour. An intervention to hold the patient onto its current location until a staff member arrives is executed when the patient is lured outside the other patient’s room. An overview of the scenario is given in Figure 5.5.

31 Figure 5.5: Flowchart representation of the unwanted visit intervention.

5.6 Multi-action intervention

A more general situation using multiple actions during an intervention can be defined by improving the use case of Section 5.5. When the robot tries to resolve a particular disturbance, it can be beneficial to split the entire intervention into several different sub-interventions. During each sub-intervention, an action is executed and analysed for further improve- ment. A general feedback signal is assigned to the whole intervention when all sub-interventions are finished, and this global feedback is based on the rewards given during the sub-interventions. Each sub-intervention can make the decision to keep executing the current action or change to another one. This decision is made based on the gathered intermediate feedback values. When the intervention is finished, after the execution of a predefined number of sub-interventions, the intermediate feedback values together with the analyses of the BD was resolved are used to assign a single global feedback value to the executed action sequence. Figure 5.6 represents the flow of this scenario.

Figure 5.6: Flowchart representation of the multiple actions intervention.

32 6 BANDIT DESIGN

The whole action selection procedure was simplified to a problem that does not involve learning to act in more than one situation. Therefore, the complexity of the full reinforcement learning problem is not needed in this dissertation. From the overview in Section 3.4, multi-armed bandits will be designed to select those actions which are the most beneficial for the PwD. To simplify the problem, the agent will operate in a non-associative setting where actions only operate in a single situation as described in Section 3.6. This chapter will discuss all the necessary components to design the bandit learning algorithms. A bandit algorithm needs three main components: an agent, policy and reward signal. Section 6.1 will compare two types of basic bandit agents and defines a contextual agent who can deal with additional information and make it possible to learn on a global level. In Section 6.2, several different policies are discussed which will handle the exploration-exploitation problem. Different reward signals will be designed in Section 6.3 to determine the feedback between the executed action and the PwD. Action selection procedures in an associative setting, when actions are taken in more than one situation, can be ben- eficially as well in the context of this dissertation because they provide more control on the executed intervention. The scenario in Section 5.6 showed these benefits by executing multiple sub-interventions. Combinations of multiple linked multi-armed bandits will be designed to deal with actions which act upon each other in Section 6.4.

6.1 Bandit agents

Agents in a reinforcement learning problem interact with the environment by performing a well-chosen action. In the full reinforcement learning problem, this action selection procedure was influenced by the current state observation of the environment together with the reward obtained from executing an action. In this section, there is only one state in the state space of the problem, and therefore, the action selection procedure will depend on the feedback signal only. At each request, an agent can take one of action from a set of actions. The action will be chosen using a strategy based on the history of prior selected actions and the outcome of previous observations. These outcomes are based on the reward signals the agent receives. This section discusses three types of agents.

6.1.1 Normal agent

For a normal agent, the reward signals inherently represent the feedback scores directly. For each action, the mean value of the unknown probability distribution will be updated with every new observation. Based on these mean values, the agent can exploit the action which results in the highest reward. The dynamics of this agent are given in Equations 6.1 and 6.2.

g(a) = 1/N(a), (6.1)

estimatet(a) + = g(a) ∗ (rt − estimatet−1(a)) (6.2)

With N(a) the amount of actions a selected in previous observations. The mean value of action a in observation t is defined by estimatet(a) and rt is the reward signal of the current executed action.

33 6.1.1.1 Normal agent example

The concepts of a normal agent are illustrated by an example. A bandit algorithm has to make a decision between 2 actions. The agent receives a reward of one for the first action 75% of the time and the second action receives a reward of one only 50% of the time. In 25% and the other 50% of the cases, the agent receives a reward of zero. Both actions are called Bernoulli distributed. Pulling The exploration-exploitation problem for this bandit algorithm is neglected during this simple example. 200 episodes were performed, where a single action was chosen in every episode. This experiment was repeated 1000 times, and the average results per episode together with the optimal action selection rates are given in Figure 6.1. In this figure, the optimal action selections increase from 23% to 79% during the first 50 observations. The average reward values increase from 60% to 80% because the first action is selected more frequently.

Figure 6.1: Normal agent design for a Bernoulli distributed action selection procedures.

6.1.2 Gradient agent

A normal agent is only beneficial when the differences between the action can be measured, and these differences are significant. When the reward signals of several actions are close to each other, the agent will alternate between these actions. In cases where these alternations could be confusing for the system or environment, the normal agent will be useless. To cope with the small differences between actions, a gradient agent will try to learn the relative differences between actions instead of determining the estimates of the rewards. By doing this, it can effectively learn a preference of one action over another. The action preferences are updated on every observation according to Equation 6.3 for the selected action and using Equation 6.4 for the other, not selected actions.

preferencet+1(a) = preferencet(a) + α(rt − rt)(1 − πt(a)) , and (6.3)

preferencet+1(an) = preferencet(an) − α(rt − r¯t)πt(an) , ∀an ≠ a (6.4) preference (a) π (a) = ∑ e t (6.5) t k preferencet(x) x=1 e

34 with α > 0 the step-size parameter, πt(a) the probability of taking action a at observation t and is based on the action preferences. ∑ (Q[t,a]/τ) (Q[t,a]/τ) In observation t, the agent selects action a with probability e / a e where τ > 0 is the tem- perature specifying how randomly values should be chosen and Q[t, a] the state-action values. When τ is high, the actions are chosen in almost equal amounts. As the temperature is reduced, the highest-valued actions are more likely to be selected. In the limit when τ goes to zero, the best action is always chosen. r¯t is the average of all the rewards until observation t. This r¯t value act as a baseline and the rewards are compared with this baseline. If the reward is higher than this baseline, then the probability of taking action a in the future increases. When the reward is below this baseline, the probability decreases respectively. All the probabilities for the non-selected actions an move in the opposite direction.

6.1.2.1 Gradient agent example

To illustrate the concepts of a gradient agent, an example is given in Figure 6.2. This example is built with the same specifications defined in Section 6.1.1.1 where the agent can select two actions which have respectively 75% and 50% chance of returning a reward of one. During the first 50 episode, the optimal action selections increase from 50% to 86%. Which bandit agent is the most beneficial depends on the situation and the reward signal.

Figure 6.2: Gradient agent design for a Bernoulli distributed action selection procedures.

6.1.3 Contextual agent

The normal and gradient agents were designed to learn the preferred action for a PwD. These agents did only take into account the reward signal, gathered at the end of the intervention. If only a single PwD suffers from a single BD, where every time the intervention can be executed in similar conditions, the previously proposed bandit agents would perform well. However, the goal of this dissertation is to develop an action selection procedure, which can deal with many different situations, or more precisely, can act in different contexts. Using extra information gathered from the environment will eventually lead to a more accurate action selection procedure, personalised for each PwD and the BD it tries to solve. As visualised in Figure 3.2, the contextual agent uses both the state and action to build its reward signal and

35 would be ideal in this case. On the one hand, the context of an intervention can be expressed as a combination of patient and environmental fea- tures. All these features can be encoded in a feature vector X. Every PwD has different characteristics, for example, a different dementia type and another MMSE score, which all can easily be transformed into features. On the other hand, the time or the current weather can influence the execution of an intervention and can be described as contextual fea- tures as well. In this dissertation, the focus lies on the differences between the patients and only their characteristics are represented in a feature vector X.

6.1.3.1 Simple Contextual Bandits

When a relatively small feature vector X with, for example, ten binary features is used, and there are only four possible actions to choose, one of the following techniques can be used to cope with the 102 = 1024 different states: • Ignore the additional context and build the probability distributions, one for each arm. This method was de- scribed in the Section 6.1.1 and Section 6.1.2 and is a good baseline to use in comparisons. • Another approach is to build a probability distribution for each context. This technique is doable for a small number of states and a limited set of actions and can be easily implemented by transforming the normal agent of Section 6.1.1 to a contextual agent. This transformation is rather too simple for our problem because an intervention only occurs for a specific PwD and this PwD is known a priori. Saving the mean values for the actions separately for each PwD result in a first simple contextual agent because the learning system can differentiate between several PwD. The action selection procedure will then select the most appropriate action based on the previous observations of a specific PwD, given as input to the selection mechanism as defined by Equation 6.6.

g(a, p) = 1/N(a, p),

estimatet(a, p) + = g(a, p) ∗ (rt − estimatet−1(a, p)), (6.6) with N(a, p) the amount of actions a selected in previous observations for patient p, estimatet(a, p) the mean value of action a in observation t for patient p, rt the reward signal of the current executed action. While this method can easily differentiate between several patients, the number of stored values will increase rapidly when the amount of PwD increases and many observations are needed to train the bandits separately.

6.1.3.2 Gradient Contextual Bandits

Another option is to represent the reward of a particular action a by a linear function of the context X to deal with a larger amount of features. Such a representation is given in Equation 6.7. E | T ra(X) = a(r X) = wa X, (6.7) with wa an unknown weight vector for action a which determines the influence of the different features onto the reward signal. The problem of dealing with different situations is transformed into learning the weight vectors for each different situation. During a new intervention, the feature vector will be used together with these weights to determine which action should be executed to gain the highest possible reward. This problem is now very similar to a classification problem in machine learning as discussed in Section 2.3.3: when a classification is needed, some feature vector X is presented, and the label will be guessed. In the contextual ban- dit case, a feature vector X is presented, and the payout (in expectation) have to be guessed. The expectation is represented here with a linear function, and a preferable action can be easily determined by Equation 6.8.

action_selection_procedure(X) = arg max wT X. (6.8) a a

36 Solving this multi-class classification problem is almost identical to Equation 6.8. However, in a classification context, the value for only a single action will be one and all others will be zero. In the bandit problem, all the actions will return a positive reward, and the action with the largest reward must be selected, taking into account their relative values. If, for example, three selection procedures with respectively rewards 0.1, 0.9 and 0.91 for a certain context are generated, differentiating between procedure 2 or 3 is less important. The application must only be sure that it does not select the first procedure. Most supervised learning techniques generate the same classifier no matter what costs are assigned to the different classes. In cases where the problem has to deal with the cost of misclassifying each sample will be needed in this case to deal with the non-zero (non-optimal) reward values. Such techniques are called cost-sensitive classifiers [78]. A second problem is the unavailability of full information in a bandit problem. While training a contextual bandit, only the current reward for an action selection procedure is visible, and the argmax only exploits the current best solution. Dealing with partial feedback and exploring additional actions will be still needed for a contextual bandit while this is not the case in supervised learning. Dealing with this partial feedback requires observing for each sample the unobserved costs. This observation can be done using different policy evaluation techniques. Doubly robust policy evaluation is such a techniques, which estimates incomplete data using a statistical approach with an important property. If either the direct method without cost-weighting, which requires an accurate model of rewards or the method with cost-weighting, which calls for an accurate model of the past policy is correct, then the estimation is unbiased [79]. Doubly robust policy evaluation method thus increases the chances of drawing a reliable inference. For example, when conducting a survey, questions such as age, sex, and family income may be asked. Since not everyone responds to the survey, these values along with census statistics may be used to form an estimator of the probability of a response conditioned on age, sex, and family income. Using importance weighting inverse to these estimated probabilities, one estimator of overall opinions can be formed. An alternative estimator can be formed by directly regressing to predict the survey outcome given any available sources of information. Doubly robust estimation unifies these two techniques, so that unbiasedness is guaranteed if either the probability estimate is accurate or the regressed predictor is accurate. John Langford, who co-researched the doubly robust policy evaluation and learning technique, build a library called Vowpal Wabbit which implements these contextual bandit learning algorithms. Because this technique and library are mostly used for simple contextual bandit design, it is utilised during this dissertation. Section 7.1.1.3 discusses the implementation details of the Vowpal Wabbit module. When a prediction is made, and the real reward value is determined after the intervention, the weights for the actions must be updated before a new intervention takes place. Due to the similarity between the contextual bandit and the classification problem, the stochastic gradient descent method can be used to optimise these weights. Stochastic gradient descent will update the weight vector after the reward value of an intervention is received by the agent. The weight vector changes according to line 6 in Algorithm 1 which represents the pseudocode of the stochastic gradient descent algorithm. η is the predefined learning rate and defines how fast the values of the weight vector changes.

Algorithm 1 Stochastic Gradient Descent

1: Given: ∀i : wi = 0 2: 3: for every given context X do∑ 4: Make a prediction yˆ = i wixi 5: Receive the truth reward value y ∈ [0, 1] 6: Update wi ← wi + η2(y − yˆ)xi

37 6.2 Policy implementations

This section will design several tactics to cope with the exploration-exploitation paradigm, described in Section 3.4.1. The goal of the action selection procedure is still to select the best possible action. Currently, the agent does not have enough information to make this decision. Therefore, exploring the action space is necessary before it can exploit a single action. Policies define the point where an agent can stop the exploration phase. The total regret function in Equation 3.20 can help to compare the different policies during this balancing act because a high total regret value usually indicates that the bandit algorithm explores too much.

6.2.1 Random policy

Selecting actions at random in each observation will only be beneficial when there is no task to learn. When the system is only interested in the mean reward values for each action, a random policy can be applied. While this will never be the case for a learning problem, the random policy gives a lower bound on the optimal regret for a particular problem. When different learning policies perform equally, worse or slightly better than the random policy, they should no be used due to time and capacity drawbacks they have compared to the random policy.

6.2.2 Greedy policy

At any time step, there is at least one action whose estimated reward value is the greatest. When this action is indeed the best possible action, a greedy policy will always select this action. This current circumstance results in the best possible situation and the learning algorithm can stop after one iteration. However, this optimistic case will almost never happen in bandit problems and selecting actions greedily can lock the system into a suboptimal action selection procedure. This policy eventually results in a linear total regret, because the algorithm never explores the other, and possibly better, actions.

6.2.3 Epsilon-greedy policy

While the random policy will keep exploring and the greedy policy starts directly exploiting, it could be beneficial to combine these two policies [68]. The epsilon-greedy algorithm will select a random action with probability ϵ and select the current best action with probability 1 − ϵ. When ϵ = 1, this policy will act complete random, and when ϵ = 0, this policy is similar to the greedy policy. Any value between 0 < ϵ < 1 will put a trade-off on the exploration and exploitation of the actions. Because this policy explores eventually forever with probability ϵ, a linear total regret will be obtained.

6.2.4 Decaying Epsilon-greedy policy

To avoid the forever exploring issue in the Epsilon-greedy policy, ϵ can be decayed after every observation [68]. With this policy, the algorithm will act fully random in the first observation if we start with ϵ = 1. A predefined parameter step will lower the ϵ value systematically until a lower bound is reached. If this lower bound equals zero, the policy will act greedy from upon that point. Both the step size and lower bound are configurable parameters and differ for every application. Decaying epsilons-greedy leads to logarithmic asymptotic total regret. Unfortunately, such a decaying schedule needs some prior knowledge to explore just enough before the system can start exploiting.

38 6.2.5 Upper Confidence Bound policy

Bandit problems with similar-looking arms but different mean values are difficult to solve. An Epsilon-Greedy policy forces the non-greedy actions to be explored, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain. It would be better to select among the non-greedy actions according to their potential for actually being optimal, taking into account both how close their estimates are to being maximal and the uncertainties in those estimates [68]. One effective way of doing this is to select actions according to Equation 6.9. [ √ ] log t at = arg max qt(a) + c , (6.9) a Nt(a) with logt the natural logarithm of t, Nt(a) the number of times action a is selected before time step t and c > 0 defines the degree of exploration. The square root term in Equation 6.9 is a measure of uncertainty or variance in the estimate of a′s value. The quantity being maximised is an upper bound on the possible true value of action a, while parameter c determines the confidence level.

Each time action a is selected, the uncertainty is reduced due to the increment of Nt(a) in the denominator of the uncertainty term. However, when an action different from a is selected, the uncertainty of action a increases, because the nominator log t will increase without increasing the denominator. The natural logarithm gives a smaller increase over time, but all actions will eventually be selected. As time goes by, actions with a lower average reward or ac- tions that are chosen more frequently will have a lower selection frequency. Therefore, selecting actions according to Equation 6.9 is called an Upper Confidence Bound policy (UCB). UCB will often perform better than the epsilon-greedy algorithms but has some difficulties when dealing with non- stationary problems.

6.2.6 Contextual policy

The learning abilities of the contextual agent described in Section 6.1.3 could be beneficial when the preferable action must be determined when additional information is given. In this dissertation, this extra information will be the patient’s characteristics. The contextual data will be given as input to the learning system, and the output returns the probabilities of each action. The most probable one is selected to be executed in this policy. This policy executes only line 4 of Algorithm 1 and can only be used together with a contextual agent. The other agents do not have the predictable capacity to determine the action probabilities given the patient’s characteristics.

6.3 Reward signal

While the previous sections in this chapter describe some general bandit components, the reward signal is always application specific. The reward signal provides the feedback to our agent after the intervention is finished. This feedback can be used to differentiate actions from each other and is needed to detect the preferred ones. Reward signals are mostly combinations of sensor values when reinforcement learning application are used in robotics. Before going into the details of the design of the different reward signals, a brief overview of all elements, which can influence the reward signal in this dissertation, is given.

39 6.3.1 Sensors

A reward signal, based on a weighted sum of different sensory values is the most common way to cope with the various states in a reinforcement learning problem. The more specific the reward signal, the more easily the differences between the actions can be measured, and the more accurate the learning process will be. Sensors on the robot, NH environment and PwD are available during this dissertation, and a summary of their availability is given in this section.

6.3.1.1 Robot sensors

This dissertation will make use of the Nao robot, which is manufactured by the Softbank Robotics company and has some benefits regarding other humanoid robots as discussed in Section 2.2.1.3. This robot manufactory provides several API’s to interact with the Nao robot and to analyse its available sensors. The API has many different modules, but the people perception’s module will be the most interesting one for this dissertation. This module uses to robot’s camera to analyse the direction of the gaze of a person and can provide estimated human characteristics based using facial analysis. Facial expression data will be used to build a first reward signal in the next section. Other modules provide tools to analyse the audio received by the microphones of the robot. Voice analyses can be used to identify the emotion expressed by the speaker’s voice. More general methods can detect the sound events and localise them. The voice analyses methods will not be used during this dissertation because of the reduced oral communication of the patient. Detecting sound events could be beneficial during the intervention of a screaming BD. Different modules give some additional knowledge about the position of the person regarding the robot or provide methods to detect faces, whether the person is seated or not or detects if an individual tries to attract the robot. The interfaces to all these different modules are listed in Table 6.1.

Interface Description ALGazeAnalysis Analyse the direction of the gaze of a detected person, to know if he/she is looking at the robot. AlFaceCharacteristics Gives additional information such as an estimation of age and gender. It also tries and detects whether the person is smiling or estimates whether a face is neutral, happy, surprised, angry or sad. AlVoiceEmotionAnalysis Identifies the emotion expressed by the speaker’s voice, independently of what is being said. AlAudioDevice Provides other NAOqi modules with access to NAO’s audio inputs (mi- crophones) and outputs (speakers) AlSoundDetection Detects the significative sounds in the incoming audio buffers. This detection is based on the audio signal level. AlSoundLocalization Identifies the direction of any loud enough sound heard by the robot. AlEngagementZones Allows classifying a detected person and/or his or her movements using their position in space related to the robot. The robots vision field is partitioned into several different areas, called engagement zones. AlFaceDetection The robot tries to detect and optionally recognise faces in front of him. AlPeoplePerception Extractor which keeps track of the people around the robot and pro- vides basic information about them. It gathers visual information from RGB cameras and a 3D sensor if available. AlSittingPeopleDetection Detect whether the current tracked person is seated or standing. When the position of the body changes, these values will be updated. AlWavingDetection Detects if a person is moving his/her arms to catch the robot’s atten- tion (for example waving at the robot). This functionality requires a 3D sensor.

Table 6.1: Interfaces of the most interesting sensors on the robot.

40 6.3.1.2 PwD sensors

A PwD is currently tracked with a Bluetooth module build in necklaces and bracelets indoor the NH. These Bluetooth trackers provide enough information to determine in which region a PwD currently behaves inside the NH and gives rough estimates of their position. Extra information about the current status of the PwD can be provided with several on-body sensors. The current body position, whether the patient stands, sits or lies down, can be useful to detect if the patient is awake or asleep. These sensors could help during the intervention of a wandering BD at night, described in Section 5.4. Other more common sensors such as a body temperature, blood pressure and heart rate sensor could easily detect stress and can analyse the effects of an intervention. More advanced sensors could assist these decisions based on the tracked activity of a PwD. Table 6.2 give a summary of all these on-body sensors together with some more advanced sensors to detect falls and sensors that can track the patients more accurately inside their rooms or in the NH. During this dissertation, only the already available Bluetooth trackers were used to make the reward signals more accurate.

Sensor Description Body position Could be beneficial to detect five body positions: standing/sitting, supine, prone, left and right. This sensor could help in fall detection, rest positions or sleeping positions. Body temperature Body temperature can be used as an activity indicator. PwD with sleep- ing disorders have, in general, a higher body temperature at night [80]. Blood pressure A high blood pressure is the major indicator of stress. Indicating this with a sensor could detect these type of disturbances. Heart rate An increased heart rate can be both an indicator of stress or illness. Body posture This sensor measures the spine position while standing up. A correct posture can increase a person’s health, but it will be difficult to adjust this sensor for PwD. Fall detector Similar to a body position sensor, but the duration of the current po- sition is measured as well. This sensor increases the confidence of el- derly, while staff or family can react more quickly to incidents. (see the FATE example [81]) Tracking sensor Tracks the current position (X, Y coordinates) of the patient inside a building using advanced techniques. Galvanic skin response sensor Measures the resistance of the skin, which changes in certain situa- tions (for example sweat). This sensor can detect stressful situations, in combination with the blood pressure and heart rate sensor. Activity sensor Tracks, for example, the number of steps taken, calories burned or dis- tances travelled.

Table 6.2: The most interesting sensors on the body of the PwD

6.3.1.3 Environment sensors

Sensors are available in every room, hall, entry and hallway of the NH. GrovePi’s, which is an electronics board that can connect up to hundreds of different sensors in combination with a Raspberry Pi, control the major part of these sensors. Everyone can program them to monitor, control, and automate devices for daily life applications [82]. Figure 6.3 visualise such a GrovePi.

41 Figure 6.3: GrovePi module on top of a Raspberry Pi.

Because this board is built upon the Raspberry Pi, small Internet of Things applications can be easily made and con- trolled. The list of available sensors is large, a small set of useful sensors is listed in Table 6.3: Light sensors can be used to differentiate between day and night, but mood patterns can change according to the daily brightness. The motion sensor returns a binary value if there is some motion nearby or not. This sensor can be used to determine whether a patient is still awake in his room during the night. Sound sensors and background noise sensors can be used to indicate undesirable behaviour, for example, screaming. This type of sensors will be used later on in this dissertation to design a more accurate reward signal. At last, humidity and temperature sensors can give some basic information about the current conditions inside the NH. Disturbances related to malfunctioning air-conditioning or heating systems could be detected more easily. Xetal sensors will allow the analysis and understanding of various human activities such as sitting, restless sleeping, wandering, room usage patterns and emergency event detection, for example, fire, falls or intrusions [83]. These sensors can provide extra information for a personalised interaction and will allow Xetal to improve and validate the accuracy of its detection algorithms. No such sensors were used within the implementation of the reward signals because of the current development of these sensors by the manufacturer. They will be implemented in the nursing homes in the next phases of the WONDER project.

Sensor Description Light sensor Differentiate between day and night Motion sensor Detect if there is motion in the room. Motion awareness can be an in- dicator of undesirable behaviour for a PwD. Sound sensor Detect sounds. Some sound levels can be indicators to several be- havioural disturbances (for example: screaming) Background noise sensor Detect if, for instance, radio/television is playing, or any other back- ground sounds. Temperature sensor Detect the temperature (can indicate problems with heating). Humidity sensor Detect the humidity (can indicate changing wheathers).

Table 6.3: The most interesting sensors available to connect with the GrovePi.

6.3.1.4 Additional information

While sensor data can make differentiate between several BD, differences in the daily or nightly patterns can give some additional information as well. Data registered in databases can provide additional information such as a patient’s age, gender, life events and dementia state. This extra knowledge can be beneficial to differentiate them from each other and is necessary for a person-centric care. Because these databases are operated by the NH, the available data can be used to personalise a chosen action.

42 6.3.2 Reward design

The sensors available in the rooms and on the PwD, described in Section 6.3.1, will eventually help to detect the manifes- tation of a BD. In contrast, robotic information will only be available during the interventions. Therefore, data available from the robot sensors was investigated first to design the reward signal. Later on, additional sensors available from the environment or on the PwD will help to increase the accuracy of these first proposed reward signals.

6.3.2.1 Consecutive decreasing reward

People with dementia residing in a NH are often demure and modest. Most of these people avoid conversations and do not speak that often. Mostly, simple words or small worthless sentences are spoken for no reason or responses on questions can become delayed [84]. Facial expression mainly compensates the rather limited available vocal ex- pressions. Analysing these facial expressions were the first point of interest to determine whether a patient liked or disliked the executed intervention. The AlFaceCharacteristics interface from the Nao robot was used to detect facial expression. This interface split facial characteristics into five categories: neutral, happy, surprised, angry or sad each ranging between 0 and 1, but with a total cumulated result of 1. For example, [neutral: 0.2, happy: 0.2, surprised: 0.2, angry: 0.2, sad:0.2] These facial characteristics of a PwD were measured several times during the intervention. The measure moments depend on the head position of the PwD according to the robot. When the robot can detect if a person is looking at it, the facial characteristics can be measured. In all other cases, this analysis will not give useful results because facial analyses will not be possible. Two modules will be designed to detect when a person starts or stops looking at the robot. Immediately after the robot encountered the person is looking, the facial characteristics are analysed, and this analysis repeats every x seconds while the patient is still watching the robot. All these facial data is saved and used to determine a single reward value after the intervention. During the design of a first reward signal, consecutive values between the surprised, angry and sad facial character- istics was measured together with the possible increased happy and neutral facial features. The neutral or happy feature values were supposed to increase during the intervention and became high at should be high at the end when the action was beneficial. How an action influences this consecutive increases is both patient and action specific. The improvement of the natural and happy features leads to the decrease of the other three facial characteristics. The goal of this reward signal was to analyse the action’s decrease of these negative characteristics and to give high reward values for the action which had the fastest decrease. The reward signal is built according to the pseudocode given in Algorithm 2. The algorithm for this reward signal will for every facial expression event in the list L of all gathered expressions when the intervention was finished compare the previous facial characteristics with the current ones and save the differences as shown in lines 6 till 10 of in Algorithm 2. Lines 11 till 16 shows the facial characteristic differences are combined together based on their type and a weighted reward signal is made. The weighted summation of the different types enables this algorithm to emphasize on increasing the positive ef- fects or decreasing the negative effects. The weights in the pseudocode of Algorithm 2 are illustrative. Optimisation techniques will be needed to assign correct values onto these weights.

43 Algorithm 2 Consecutive decreasing reward 1: Input: events, duration intervention 2: Output: Reward signal 3: 4: L ← list with facial expressions extracted from events 5: q ← empty list 6: for i:=1 To length(L)-1 Step 1 do 7: f ← empty list 8: for j:=0 To 4 Step 1 do 9: f.add(L[i + 1][j] − L[i][j]) 10: q.add(f) 11: v ← empty list 12: for i:=0 To 4 Step 1 do 13: for j:=0 To length(q) Step 1 do 14: v[i]+ = q[j][i] 15: v[i] = v[i] ∗ (length events)/duration intervention 16: return (1 ∗ v[0] + 3 ∗ v[1]) − (1 ∗ v[2] + 2 ∗ v[3] + 3 ∗ v[4])

6.3.2.2 Service personalisation

A linear decrease in some facial characteristics will be rather rare in the observation of a PwD. Expressions are rather direct and can influence over time. Nobody can express a decrease in sadness while increasing its happiness at the same time. Therefore, techniques found in adaptive service personalisation were used to define a more appropriate reward signal [54]. Adaptive service personalisation is built around balancing positive and adverse reactions during an intervention. When there are more positive responses detected, the end reward will be more likely to be positive as well. The values of the responses change adaptively when the service is personalised. This adaptive character is already implemented inside the agents and therefore not needed in the design of a reward signal for learning the action preferences. The service personalisation-based reward signal is given in Algorithm 3 and uses two values to provide feedback for the executed action: • The number of positive respectively negative detected expressions are calculated inside the frequency variable in lines 10 till 15 of Algorithm 3. These values are calculated according to Equation 6.10. [ ] [ ] f⊕ n⊕/N f˜ = = , (6.10) f⊖ n⊖/N

with n⊕ and n⊖ and N the total amount of received expressions. • While the frequencies give some knowledge about the occurrences of the expressions during an intervention, the amount of positivity or negativity can be relevant as well. The energy for the recorded expressions in both the negative and positive group is calculated using Equation 6.11. { 0, n = 0 ∑ if me = n , (6.11) ( 1 e)/n if n > 0

with e the values of a captured facial expression according to the associated group. For example, in the positive group, only the values for the happy and neutral categories are used in Equation 6.11. The energy values are also determined in lines 10 till 15 of Algorithm 3.

44 Line 19 calculates the reward signal using both the energies and frequencies according to Equation 6.12, but normalised to an interval between 0 and 1. f⊕m⊕ + f⊖m⊖ R = (6.12) f⊕ + f⊖

Counting the number of reactions during an intervention can give enough information about the status changes of a PwD, but the influence of the responses can differ. Therefore, the total energy of all reactions is registered as well in this algorithm. While there can be a lot of positive responses, some high energy negative reactions can lead to a negative reward and represent a dislike for the chosen action. This algorithm is based on the emotion-aware music recommendation system from Khosla et al. [54], which used facial expressions characteristics analysed by a robot to determine the song preferences from different people.

Algorithm 3 Adaptive service personalisation 1: Input: events, duration intervention 2: Output: Reward signal 3: 4: L ← list with facial expressions extracted from events 5: freq ← [0,0] 6: exprE ← [0.0, 0.0] 7: for i:=0 To length(L) Step 1 do 8: positve ← L[i][0] + L[i][1] 9: negative ← L[i][2] + L[i][3] + L[i][4] 10: if positive < negative then 11: freq[0]+ = 1 12: exprE[0]+ = pos 13: else 14: freq[1]+ = 1 15: exprE[1]+ = pos 16: if freq[0] == 0 And freq[1] == 0 then 17: return 0 18: else exprE[0]∗freq[0]−exprE[1]∗freq[1]) 1+ length(L)∗(freq[0]+freq[1]) 19: return 2

6.3.2.3 External sensors

The facial expression scores, provided by the Nao robot, are a good starting point to determine which actions are the most beneficial. However, any additional sensor can increase the accuracy of these reward values. The goal is to design a mechanism to add additional sensors providing additional reward signals and combine them all together with the facial expression scores in the end. In this section, two sensor values will be analysed for two different BDs: screaming and wandering. Both BDs have different needs in care and require a different intervention strategy with different actions. Which actions should be used in which situation and how to detect these behavioural disturbances are not in the scope of this dissertation, but possible solutions are discussed in the WONDER project. • Screaming: The screaming behaviour of PwD can be monitored with the GrovePi sound sensor, listed in Table 6.3. The reward signal for this BD is also based on the adaptive service personalisation approach. During the intervention, sound levels are measured at certain timestamps where a parameter in the application defines the time between these measurements. When a measure is higher than a predefined upper level, a

45 negative effect is registered. When the measure is lower than or equal to the predefined level, a positive effect will be analysed. Tests will have to analyse which value is the best for this predefined level. The algorithm for this screaming reward is written in pseudocode in Algorithm 4. In line 13, the timestamps of the events are used to determine the importance of a given sound level. Using this approach, sound levels during the first seconds of the intervention are more likely to be negative and should have less impact on the final reward signal then the values near the end. This approach will put weights onto the sound levels based on their occurrences. From line 15 till 27, the same service personalisation techniques are used as described in Section 6.3.2.2 but with weighted energy values. These weighted energy values will guarantee that the robot’s actions will be rewarded for their long run effects.

Algorithm 4 Screaming reward 1: Input: events, duration intervention 2: Output: Reward signal 3: 4: pl ← predefined sound upper bound level 5: x ← list with timestamp in seconds of events 6: y ← list with sound levels of events 7: end ← x.last() (end of interval) 8: xx ← empty list 9: yy ← empty list 10: if length(x) == 0 then 11: return 0 12: for i:=1 To length(events)-1 Step 1 do x[i+1]−x[i]) (i+1)∗end− end 13: xx.add( length(x) y[i−1]+y[i] 14: yy.add( 2 ) 15: freq ← [0,0] 16: exprE ← [0.0, 0.0] 17: for i:=0 To length(xx) Step 1 do 18: if yy[i] < pl then 19: freq[0]+ = 1 20: exprE[0]+ = 1 − (yy[i] ∗ xx[i])/1023.0 21: else 22: freq[1]+ = 1 23: exprE[1]+ = 1 − (yy[i] ∗ xx[i])/1023.0 24: if freq[0] == 0 And freq[1] == 0 then 25: return 0 26: else exprE[0]∗freq[0]−exprE[1]∗freq[1]) 1+ length(L)∗(freq[0]+freq[1]) 27: return 2

• Wandering: The interventions for a wandering BD can be divided in two different cases. The first one tries to stop the circle wandering pattern of a PwD. An intervention strategy will be designed to interact with the PwD and will hold the patient at his or her current position. A second strategy will try to lead the PwD to a certain location, for example, the restaurant of the NH. In both cases, the GrovePi sensors will track the position of the PwD during the intervention and assign a positive reward if the correspondent goal was reached. Because both cases are somehow the inverse of each other, only the first case will be discussed and implemented in detail because leading the PwD to a certain location can be analysed by a continuing change of its current position.

46 The reward signal is based on the Received Signal Strength Indicator (RSSI) value registered by the GrovePi’s Bluetooth module and transmitted by the wearables. RSSI can make an indication how far the GrovePi is ac- cording to the PwD based on the signal strength of the interconnection between both. Determining an accurate position of the PwD with these RSSI values will be difficult, because of the many dif- ferent interferences that can occur and the fluctuating values caused by these interferences in a NH. Therefore, the reward signal will try to represent a change in pattern at certain intervals during the intervention. The av- erage of all these RSSI values in a certain interval can flatten the fluctuations, but there is no guarantee there will be enough measures in each interval to get an accurate average value. Therefore, a Kalman filter was used on these measures to subdue these fluctuations and make the measures more confident, resulting in a more reliable average value in every interval [85]. A Kalman filter is more commonly known as a state estimator that makes an estimate of some unobserved variable based on noisy measurements. It is a recursive algorithm as it takes the history of measurements into account. RSSI values are measured in dB, and a small change between consecutive values have more impact for larger RSSI’s. For our reward signal, a more linear comparison between the values of two intervals was more appro- priate. Therefore, the RSSI values are changed to distance measures using Equation 6.13.

T xP ower−RSSI distance = 10 10∗n , (6.13)

with T xP ower given by the wearables and a parameter n between 2 and 5, indicating the material between sender and receiver, where 2 indicates free space. When there are obstacles between the sender and receiver, the value of n can be increased. These distance values will only give an indication of how far the PwD behaves from the GrovePi which has received the signal. There is no possibility to use these values to acquire an accurate position of the PwD. These values will only be used to measure the linear difference in position between several time intervals. When these differences are small, the PwD did not move during the intervention. The average weighted sum of these individual subrewards based on the external sensors and the robot facial expres- sion analyses will result in a more accurate total reward signal. The more precise a single subreward is, the higher the weighted factor can be and the more its value will influence the total reward signal. Such a reward signal based on the weighted sum of subrewards from different sensors is given in Equation 6.14. The parameters α, β, . . . are the weights for each subreward. The total sum of weighted values is eventually divided by the number of subrewards to get reward value between 0 and 1, when all the subrewards and weights have values between 0 and 1.

α ∗ r1 + β ∗ r2 + ... R = , with 0 < α, β, ··· < 1 (6.14) #subrewards

6.4 Linked bandits

Executing a single action during the whole intervention will in some cases lead to a suboptimal behaviour and will not result in the best possible outcome. If, for example, a PwD need to be guided to a certain location, it can be more beneficial to analyse and execute a number intermediate actions or several interventions. Therefore, splitting the whole intervention into sub-interventions can be beneficial. Each sub-intervention will somehow influence the following sub-intervention. This pattern can be associated with the full reinforcement problem given in Figure 3.2, but with the only exception that an action only influences the previous state. In a full reinforcement problem, actions may affect all the previous states. The design of these linked bandits will be rather straightforward. The action for each sub-intervention is determined by a contextual bandit, where an additional feature will be added: the reward from the previous bandit, except for the bandit of the first sub-intervention. Together with additionally updated contextual parameters, the contextual bandit in every sub-intervention will be able to determine if the previous action is still the most beneficial one or not.

47 6.5 Main findings

In this chapter, the three most essential components to build bandit algorithms were defined. Agents are the processing units of our learning algorithm and will analyse incoming rewards to determine the effects of previously executed actions. Three different agents were described and which one is the best, is problem and situation specific. Tests will eventually have to determine the preference of one agent above the others. Policies determine which actions should be executed based on the information inside the agent. These policies try to find a balance between exploring new actions and exploiting the best ones to receive an optimal total regret. Six different policies were designed in this chapter, and again tests will eventually determine which policy to use for different situations. At last, several sensors were analysed to build an appropriate reward signal that can be used in the determination of the preferable actions that should elevate and prevent BD during an intervention for PwD. Facial expression characteristics, analysed by the robot, will be used to build a first reward signal. Several other reward signals for individual cases were examined and implemented. While these specialised reward signals look promising to elevate the wandering or screaming BDs, the more general learning algorithms based on the robotic facial expression should be analysed first, because these will provide a base reward signal for all intervention types.

48 7 IMPLEMENTATION

Chapter 6 described different agents, policies and reward signals but the goal of this dissertation is to find which combinations of these components can select the most beneficial actions. Before starting to compare the different learning algorithms, every component must be implemented. Surrounded by a framework, all these parts can be linked together and will form different bandit algorithms. These bandit algorithms will then be used to test and analyse several situations to find the most appropriate selection procedure for this problem. Section 7.1 of this chapter gives the implementation details of the components together with additional information about the used modules, packages and libraries. Figure 7.1 presents an overview of all these used packages, divided per category. A summary of the used hardware and software in each phase of the analyses process is given in Section 7.2 as well.

Figure 7.1: Overview of the different used packages.

7.1 Bandit application

The algorithms defined in Chapter 6 are inspired by methods and techniques described by Sutton and Barto and im- plemented by modifying the code from Galbraith [68,86]. Every bandit component is implemented in Python version 2.7. The Python programming language was chosen for three reasons: 1. The code provided by Sutton and Barto is mostly written in Python. 2. Python is well known for its simulation purposes and ease of use 3. The communication with the Nao robot can be written in Python, and the manufacturer offers a Python API to interact with the robot. During the implementation of the application, several Python packages were used. The most used packages together with a brief summary of their functionality are provided here.

7.1.1 Application packages

The next Python packages are used during the development of the bandit application and will affect the functionality of both the simulation and real robotic tests.

49 7.1.1.1 NumPy

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidi- mensional array object and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical opera- tions, random simulation and much more. The NumPy 1.12.1 version was used in this dissertation [87]. This package is excessively used to generate random numbers with certain probability distributions to mimic the behaviours of a PwD during the simulations.

7.1.1.2 Pynaoqi

Pynaoqi is the library, which can interact with the designed application and a (virtual) Nao robot. This library ab- stracts the communication between these two entities and provides an API for consulting the different modules of the robot. The most used modules are the memory module to gather useful information of stored features, the facial character- istics module to enable and disable the analysing process respectively before and after the intervention and the gaze analyses module to detect when the patient starts and stops looking at the robot. Version 2.1.4.13 of the Naoqi SDK was used to build the application [88].

7.1.1.3 Vowpal Wabbit

Vowpal Wabbit is an open source fast out-of-core learning library and program developed originally at Yahoo! Re- search, and currently at Research. It is notable as a scalable, and efficient implementation of online ma- chine learning techniques and supports some machine learning reductions, importance weighting, and a selection of different loss functions and optimisation algorithms. VW has a Python wrapper that can easily be used for matrix factorizations, cost-sensitive reduction for multi-class classification and has even predefined learning algorithms for contextual bandits. It is in the context of these contextual bandit algorithms that the VW Python wrapper was used. Vowpal Wabbit has many different versions and wrappers for several programming languages. The Python wrapper based on version 8.3.2 is used in this dissertation [89].

7.1.1.4 time and threading

These modules provide various time-related and multi-threading functionalities and are inherently built-in in every Python version. The sleep function of the time package is used to suspend the execution of the current thread for a given number of seconds. The threading package provides all the necessary tools to create, control and execute different threads, which made the analyses of a simulated PwD possible.

7.1.2 Analytical packages

After the execution of several tests, which could be both simulated or real robotic tests, a rather significant amount of data was available to analyse. The next two Python packages were useful to represent this amount of data and helped in further analyses.

7.1.2.1 Pandas

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labelled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis. Internally, a Pandas Dataframe, which is a 2-dimensional labelled data structure with columns of potentially different types, stores the data. These dataframes are similar to a conventional spreadsheet or SQL table. These dataframes can

50 handle enormous amounts of data, and several built-in operations are available to perform simple analyses. Pandas 0.19.1 was used during the development of this application [90].

7.1.2.2 Plotly

In the first phases of the development process, the Python packages: Matplotlib and Seaborn were used to generate multiple graphs from the modified data. While these packages could give enough possibilities to visualise the results, a more user-friendly analytical tool Plotly was used. Plotly provides online graphing, analytics, and statistics tools, as well as scientific graphing libraries for Python, R and MATLAB. The graphing tool consists of a graphical user interface for modifying labels and texts after the graphs are already generated and visualise simple interactive plots with data from several sources as well. The scientific Plotly library for Python version 2.0.5 was used to visualise all the results [91].

7.1.3 Data storage

The data, gathered during the tests, was initially saved to disk before post-processing and analysing. A Python package, which could load and save huge amounts of data, was used to simplify these operations. During real robotic tests, data gathered from the robot and sensors was collected using several external databases, because of the limited internal storage possibilities of the robot. A wrapper, which can make a connection to these databases and query the data, reduced the complexity and workload of the application.

7.1.3.1 h5py

The h5py package is a Pythonic interface to the HDF5 binary data format. It can store huge amounts of numerical data, and easily manipulate that data from NumPy. A HDF5 structure is shown in Figure 7.2. The advantages of this structure are the ability to define groups, which can be, for example, several subtests in different test cases, and the possibility to add different metadata to each component in the structure. With these metadata tags, the various test cases can be separated before the whole dataset is read into memory. The processing units used the h5py package version 2.6.0 [92].

Figure 7.2: Example of a HDF file structure which contains groups, datasets and associated metadata [93].

51 7.1.3.2 InfluxDB-Python

All the sensory information gathered in this application is provided with a timestamp, which should be stored in a system that is optimised for handling time series data and can be accessed more easily in a real time setting. InfluxDB is an open-source time series database developed by InfluxData. It is written in Go and optimised for fast, high-availability storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of things sensor data, and real-time analytics. During the real robotic test, the Influx database stores all the data from the external sensors and the Nao facial expression measurements. The application can calculate the reward by querying the data of a certain intervention. The connection mechanism and querying of the data are controlled by a the InfluxDB-Python client. Version 4.0.0 of this Python client was used during the development of this application [94].

7.2 Framework

This section will give a brief overview of the used hardware during both the simulation and robotic tests, together with their specifications. These different hardware components will provide a framework for several test approaches.

7.2.1 Simulation

All the simulation tests were performed on a MacBook Pro (Retina, 13-inch, Mid 2014) running MacOS Sierra 10.12.3 with a Dual-core 2,6 GHz Intel Core i5 processor and 16 GB 1600 MHz DDR3 RAM memory. A virtual Nao communication module was used to make sure the designed simulation application could run onto a real robot without huge additional modifications. While different test set-ups were offered during this dissertation, which had a lot more processing power and internal memory (for example, the iLab.t virtual wall or HPC at UGent), the interoperability between the application and this Nao communication interface inhibited the usage of these sys- tems.

7.2.2 Hardware

In a more realistic setting, a real Nao robot and external sensors were used. The specifications for this robot and the GrovePi module are given here.

7.2.2.1 Nao robot

During the robotic tests, the Nao robot V3 was used, which has a 1.6 GHz CPU and 1 GB of RAM memory. Because this system has low computable power, all sensory data was pushed to an Influx database and analysed on the same MacBook Pro described in Section 7.2.1. Because pushing sensory data from the robot to an application could be beneficial in further projects as well, a REST API was designed to make the current application independent of the underlying database.

7.2.2.2 GrovePi

GrovePi is an add-on board for the Raspberry Pi that connects modular sensors to the Raspberry Pi version 3. In this dissertation, sound and background noise sensors gathered some environmental information, which can detect a change in behaviour during the intervention. A Bluetooth module connected directly to the Raspberry Pi detected the emitted beacon signals from the patient trackers.

52 Because of the rather small processing power of the Raspberry Pi’s and the possibility to use several of them in a NH, the same REST API was adapted to send all the collected sensory information together with the robotic data to an underlying database. The implemented REST API and databases were run locally on the same MacBook Pro described in Section 7.2.1 during the tests. This REST API is implemented in Node.js version 5.7.0 and summarised in Appendix A [95].

53 8 SIMULATION

Chapter 6 described all the components, namely: agents, policies and reward signals, to design bandit algorithms. This chapter combines all these components into a first learning application. Data on how a PwD behaves in certain situations or how they react when the nursing staff tries to elevate a specific BD was not available during this dissertation. Directly evaluating the designed bandits on these PwD could resolve this unavailability but performing a significant amount of tests could have an adverse impact on the QoL of these PwD. Therefore, a simulation program is designed to mimic the behaviour of the PwD and its environment. This simulation program has some additional benefits: • The Robotic interactions and external sensors can be simulated as well, which reduces the time to test. • Noisy sensors can be simulated, and their effects influence can be analysed during simulation. • Different situations and scenarios from Chapter 5 can be easily analysed with a simulator. The only drawback of this simulation approach, is the possible non-realistic representation of the patient and his or her reactions to the robotic interventions. Therefore, real tests will still be needed but only for those algorithms that perform well during the simulation tests. Fewer tests with real PwD will be necessary using this simulation approach. In this chapter, Section 8.1 describes the architectural design of the simulation application. Details about how the PwD are simulated are given together with a description of the different bandit algorithms. The performed tests with this simulator are then described in Section 8.2, while Section 8.3 gives some general conclusions on this simulation approach.

8.1 Architectural design

The design of the simulated application exists roughly about three different main parts. A first part will have to implement the bandit components and combine them together. Agents must receive an appropriate reward signal to determine the action probabilities and policies will need these values to determine the action selection procedure. Because these components frequently interact with each other, they are all combined into the bandit package. A second part defines the virtual character of the simulated application. The environment, PwD and robot are all abstracted versions from the real world, and their behaviours should be mimicked. The third part combines these two parts together by defining the link between the bandit package and the virtual robot. This part also provides the main class to run the different tests. An overview of the simulation application is given in Figure 8.1. The next sections give a brief explanation of all different classes.

54 Figure 8.1: Overview of the simulation application

8.1.1 Bandit package

The bandit package combines all the components described in Chapter 6 into problem specific bandit algorithms. This package is separated from the other parts of the simulation application because it will be reused completely during real robotic tests and dependencies with simulation classes were therefore avoided. The different policies on top of this package are the direct implementation of the described policies in section 6.2. The importance of this package lies in the interaction between the agent and multi-armed bandit class.

8.1.1.1 Agent

The agent class implements the mechanisms to take one of a set of actions during each time step of the simulation. The multi-armed bandit defines the possible actions, and the agent chooses an action using a specific strategy, im- plemented by an installed policy. This selection procedure uses the history of prior actions and observation outcomes. These outcomes are the received reward signals from the bandit’s executed actions. After observing and analysing these feedback values, the action’s influences are directly saved inside the agent’s model. The type of agent defines the specifications of this observing method. The different agent types and the pseudocodes for their observing method were described in Section 6.1.

8.1.1.2 Multi-armed bandit

This class provide the executions of the selected action. The naoqibandit inherits this Multi-armed bandit class and becomes the interface between the bandit application and the robot. The chosen agent’s action is ’pulled’ to the Act class, which will define whether the current action must be executed in the simulated environment or on a real robot. The pull method will wait for the corresponding reward signal. With this approach, the bandit package abstracts the communication with the robot.

55 8.1.2 Environment

Every simulation test exists of a predefined number of trials or timesteps. In every trial, an action is executed based on the information gathered from previously executed actions. Because most of the methods are influenced by ran- dom parameters, for example, a random virtual patient or a random selection of actions by a policy, the simulated experiments are repeated for a chosen number of times. During such a single experiment, the agent and Multi-armed bandit objects are linked together and execute three steps which analyse the bandit algorithm over several trials:

action = agent.choose(patient) reward = bandit.pull(action, patient) agent.observe(reward, patient)

Together with the specifications of the agent and policy type, these three actions define a bandit algorithm. The agent and Multi-armed bandit classes do not interact directly with each other to make the clear difference between what the agent decides and how the environment reacts to these decisions. When the experiment finishes, the results are gathered and saved for further analyses. This storage process uses sev- eral Pandas DataFrames to combine all the trials of the experiment. The full experiment is saved into an HDF5 structure afterwards. More details about these storage capabilities can be found in Section 7.1.3 and Section 7.1.2. The environment class has some small analysing methods to determine if a certain bandit algorithm is useful or not. Because these analyses could delay the whole simulation process, post-processing procedures on the saved data were used to compare the different bandit algorithms.

8.1.3 Virtual patient

The agent should eventually execute the selected action on a patient. During simulations, the bandit algorithm exe- cutes these actions on a virtual PwD. Because the intervention strategy for a specific patient is defined during the first steps of a simulation test, a virtual patient object will travel through the whole application. The Patient class has some methods to generate a random PwD. Several characteristics determine this virtual PwD: • name: Used to differentiate more easily between patients and used to gather the results from a virtual patient database. • gender: Men and women could react differently to a certain intervention. The demographic data from the Belgium population was used to make a difference between men and women, and this same data was used to sample the ages of our virtual PwD. • dementia type: The dementia type can influence the reactions of a PwD and different dementia types incorpo- rate different BDs. Patients are likely to react differently to the robot, according to his or her dementia type. Some dementia types are more common than others. During the creation of a random virtual PwD, these dif- ferent occurrences were taken into account. The different dementia type distributions were selected according to the diagram in Figure 8.2.

56 Figure 8.2: Distribution of diagnoses of dementia occurring in later life [96].

• age: The age of a PwD influences largely the course of the dementia disease and is an important indicator for many other characteristics. Because the virtual patient’s population must resemble a realistic one, the randomly selected age of our virtual PwD is based on an actual age-dementia severity distribution. Patients with severe dementia were simulated based on the age distribution given in Table 8.1.

Table 8.1: Distribution of elderly diagnosed with AD [97]

• MMSE: Several dementia measures were discussed in Section 2.1.3, but the most common and widely used measure is the MMSE. The capabilities of a PwD and how they react to an intervention are affected by their MMSE score. Additional patient characteristics were needed to simulate the MMSE score for a virtual patient. Two features that influence this score were already described: the age and gender. Further on, both the number of years since the disease was detected and the degree of education have an effect on the scoring. How these characteristics influences the MMSE score, is visualized in Figure 8.3 and Figure 8.4. To define an appropriate random MMSE score for our simulated PwD, a combination of these two figures was implemented.

57 Figure 8.3: Progression of the dementia symptoms over the several detected years in AD [98].

Figure 8.4: MMSE score at the time of first diagnosis according to age, educational level and gender [99].

• Personality: The current mood of the virtual PwD is defined using five randomly generated personality char- acteristics. These five values identify the chance of being happy, neutral, psychotic, depressed or agitated respectively and are similar to what the facial expressions module of the Nao robot can detect. The link between the dementia type and the behavioural symptoms of a PwD suggests the influence of this dementia type onto these characteristics. Therefore, the random values of these personality properties mainly depend on the chosen dementia type of the virtual patient. The largest of the five personality characteristics will determine the current mood of the virtual PwD. • Preferable actions: Interventions will influence the mood of the PwD. With this parameter, the impact of each action onto the virtual PwD is determined, and a preferable action for the current situation is selected. The effect onto the five personality characteristics determines the influence of each action on the virtual PwD. Some actions will both increase the ’happy’ characteristic while decreasing the ’agitation’ and ’depressed’ ones. Other actions will react differently.

58 hree different influence strategies were used during the simulation tests, because, how an action influences a PwD is not known. These three strategies are: 1. Optimistic case: There will be one action in the action space which will increase the happiness or neutral personality characteristics while all the other actions perform badly on this virtual PwD. 2. Neutral case: All actions will have a similar effect on the virtual PwD. The difference between the per- sonality characteristics is minim. 3. Worst case: All the actions increase the negative personality characteristics of the virtual PwD, but one action is slightly less bad than the others. The preferred action for a virtual PwD is predefined and based on the dementia type during the simulations. The next section of this chapter will discuss the performance of the bandit algorithms, based on these three different strategies. • Others: There are some additional person-specific features which could be beneficial in later simulations. The heart rate, blood pressure and BMI of a PwD during a BD can be simulated as well and can give some additional information about the preferences of certain actions. These values are simulated but not used further on, because there are currently no such sensors available in the NHs to control the sensor values during real robotic interventions. The main purpose of this Patient class is the representation of the personality characteristics during the simulations. These five characteristics represent the expressions of our PwD, which can be analysed by the robot. The initial val- ues of these five features are determined randomly but related to a certain behavioural disturbance. The values for the neutral or happy characteristics will, for example, have a small chance to be high. These BDs are also randomly generated, based on the dementia type, age and MMSE score. During a simulation, the executed action will update the virtual PwD’s characteristics using one of the three previously discussed strategies. The virtual robot analyses the changes between the previous personality features and the current ones and determines a feedback signal for the bandit algorithm based on this difference. During the real robotic tests, this virtual patient object will not be used, but a similar object will deliver some additional information received from the internal databases. The bandit-selected action will then be executed onto a real person. Again, this motivates why the interactions with a virtual PwD were abstracted from the bandit package.

8.1.4 Virtual robot

There are several modules which, when combined, will simulate the robotic behaviour. This section gives an overview of these modules.

8.1.4.1 Naoqi-bin

The Aldebaran manufacturer, which is part of SoftBank Robotics, provides several solutions to simulate a virtual Nao robot. The solution used in this simulator is the Naoqi-bin. This binary package simulates a real Nao robot without the speech recognition, audio player, led functionalities or graphical output. Robotic commands and movements can be sent to this virtual robot, and the binaries will calculate the new motors positions. Figure 8.5 gives an overview of the Naoqi architecture. The Naoqi-binary will abstract all the modules outside this Naoqi package. The broker is an object that provides two main roles: • It provides directory services, allowing to find Naoqi modules and methods. • It provides network access, allowing the methods of attached modules to be called from outside the process.

59 Figure 8.5: An overview of the Naoqi architecture [52].

ALMemory defines the robot’s memory. All other modules can read or write data from or to this module, or can subscribe to events so as to be called when they are raised. For this dissertation, a simple Naoqi module was designed and registered onto the robot using the broker’s communication functionalities. This module subscribed several methods to events of the ALMemory module to react upon changed sensor values. Section 8.1.4.2 defines the functionalities of this module in the Nao Handle. The Device Communication Manager (DCM) is the software module that is in charge of the communication with all electronic devices, such as the boards, sensors and actuators except for the microphones and the cameras, inside the robot. It uses two interfaces: • The Hardware Abstract Layer (HAL) daemon, which handles the hardware, • The DCM Naoqi module, which connects the HAL with the Naoqi package.

8.1.4.2 Nao Handler

The Nao Handler will register several methods which can react to events fired by the robot. The goal of this class is to detect when a person responds to the robot’s interactions and to start the analyses of the facial expression of a PwD, without knowing if the patient is a real or virtual one. A first method will analyse the current head positions of the PwD. The robot fires a PersonStartsLookingAtRobot event from the Naoqi GazeAnalysis module when a PwD starts looking at the robot. The designed method will react on this event by executing the analyzeFaceCharacteristics method from the Naoqi ALFaceCharacteristics module. Facial analyses require that the patient his or her head is right in front of the robot’s camera and this can be ensured using these designed methods. The facial expressions are analysed every x seconds if the patient keeps looking at the robot. If the PwD stops looking, the PersonStopsLookingAtRobot event will be fired from the same Naoqi GazeAnalysis module, and the facial analyses are suspended. This procedure can be repeated for a certain amount of time, more exactly, for the entire duration of the intervention. All facial expressions values are gathered in a list and will be returned to the bandit algorithm. A proper reward signal can then be built using these gathered facial analyses.

8.1.4.3 Robot simulator

During real robotics test, it is easy to start and stop looking at the robot. The simulator must be able to fire these events itself. The RobotSimulator class raises these events and stores the correct facial expression data into the ALMemory module.

60 Figure 8.6: An overview of the robotic analyses procedure of a PwD.

When a PwD starts and stops looking at the robot is unknown upfront and are therefore randomly determined by this robot simulator. The events are raised according to a normal distribution based on the patient’s MMSE score. A PwD with a lower MMSE score will react less frequently on a robotic intervention, but the simulator could change these frequencies to investigate different situations.

8.1.4.4 Act

The Act class combines the facial expressions values and builds an appropriate reward signal for our bandit algorithm. This reward signal resembles the pseudocode of Algorithm 3. There are two versions of this Act class: 1. One version uses all the virtual robot described in Section 8.1.4. Interoperability is ensured between the de- signed simulator and real robotic applications using these classes, because of the extensive use of the Naoqi- bin, which is an abstraction of a real robot. The drawback of this method is the event-firing mechanism. The Naoqi-bin has some limitations about the number of fired events. These limits are almost never reached in real life applications, but in this simulated application, the events will be fired more rapidly to reduce the test durations. The simulation speed is, unfortunately, limited by these predefined Naoqi boundaries. 2. To overcome these limits, the second version of this class avoids the virtual robotic classes and simulates the facial expressions directly. This version does not use the Nao Handler class to extract the facial expression values because these values were simply inserted in the ALMemory module by the Robot Simulator class. The analyses of this version did not encounter major differences for these facial expressions. The main advantage is the speed-up of the simulated tests. This version can execute simple tests almost instantly. The first, rather small, tests were executed using the first version to ensure the whole application could run on a real robot as well. All the tests described in the next section were executed using the second variant of the Act class. While the Act class mainly uses the facial expression data to build the reward signal, data from external sensors can be added in this class as well. A weighted total reward can be designed in that case. Figure 8.7 shows the architectural details of how additional sensors could be added to this Act class. Future sensors can be easily implemented and added

61 to the total reward with this approach. The two external sensors described in Section 6.3.2 could be implemented using this paradigm.

Figure 8.7: Simple architectural design to combine additional sensors with the robotic facial expression rewards.

8.1.5 Simulator

The correct parameters for the different test setups must be defined in the Simulator class. Inside this class, various agents, policies and virtual patients can be defined, and it is also responsible for registering the Nao Handler correctly onto the Naoqi-bin for the first version of the Act class tests. Because of the significant amount of test cases, this class makes use of multiprocessing capabilities. In combination with the second variant of the Act class, running the simulations on multiple processor cores reduced the test durations a lot.

62 8.2 Simulation tests

The different situations, in which the bandit algorithms should perform, were described in Chapter 5. These scenarios were used to simulate realistic situations where several bandit components can be analysed and compared. This section will give first an overview of these different possible test combinations. Second, tests using the simulator were performed on both a single patient and on multiple virtual patients to show the effects of learning the action preferences for a single patient or learning more globally, in multiple different situations.

8.2.1 Overview

An overview of all the possible tests is provided in Figure 8.8. There are in total 135 different test cases, but they were not all tested utilising this simulator. Some tests were neglected because the used algorithms had shown negative performances in previously executed tests. While these tests could give additional insights, the goal of this dissertation is to find an optimal learning algorithm, which is preferred in most situations. The different decision points in Figure 8.8 are briefly summarised next.

Figure 8.8: Overview of the test selection procedure or test plan.

63 8.2.1.1 Agent selection

In Section 6.1, three different bandit agents were designed, and tests will identify which agent should be used to resolve a BD for one or multiple PwD. These three agents were compared in different situations. During multi-patient tests, a contextual agent will be able to handle the differences between the virtual PwD when trained together. The most preferred action of these virtual patients depends on the dementia type, and the learning system should be able to detect the similarities between patients with similar characteristics. The gradient agent has both a baseline parameter and an influence factor. Several tests were needed to optimise these parameters, for this bandit problem.

8.2.1.2 Policy selection

Section 6.2 defined six different policies, and their performances are compared during several tests. The exploration- exploitation trade-off is crucial for the solution of a bandit problem. Some policies, such as the random policy, are already stated to be inefficient for this bandit problem but evaluating them can give a lower bound for the bandit’s performances. Some policies have tunable parameters. Several tests were needed to define the best parameters for this bandit problem.

8.2.1.3 Strategy selection

During the design of the virtual patient in Section 8.1.3, three different strategies were described, namely, the opti- mistic, neutral and worst case. It was mentioned that these strategies would be used during the tests to simulate the different behaviours of a PwD during an intervention. Algorithms, which perform ideally in one situation, could perform badly in another one. All algorithms are therefore tested for these three different cases, and real robotic tests should use the best overall bandit algorithm.

8.2.1.4 Adaptation rate

With this simulated approach, additional case studies can be tested, for example, the effects of the learning system when a new PwD was admitted or when the dementia status of a PwD changes suddenly. Tests tried to analyse these use cases by changing the preferable action patterns at a certain moment in time. Bandit algorithms could react differently to this change in behaviour, and these reactions were analysed.

8.2.1.5 Noise

The reward signals are based on sensory data and during the first couple of test, the sensors were assumed to work with high accuracy. It would be beneficial to analyse the bandit algorithms when these sensors work less accurate. Tests with noise on the reward signal were executed on several different bandit algorithms to find the most robust ones.

8.2.1.6 Additional sensors

The service personalisation algorithm, defined as pseudocode in Algorithm 3, was used to construct the reward signal during the simulated tests. Additional sensors could increase the accuracy of the reward signal during an intervention. The influence of an additional sensor was tested over several different bandit algorithms and for the three different strategies.

64 8.2.2 Single Patient tests

During the single patient tests, the bandit algorithms will try to learn the action preference for one virtual patient. The situations remained fixed during multiple interventions: the same virtual patient was used during consecutive interventions, and the same BD was encountered before an intervention started. Because contextual information is not needed in this case, the contextual bandit algorithms were neglected. The different test cases from Figure 8.8 for a single virtual patient are compared and visualised in the next sections. A first major test compared the different bandits and policies for detecting the optimal action procedure. A second test will analyse the influence of noise onto the reward signal. At last, the adaptation rate of several algorithms is analysed when the behavioural pattern of the patient changes suddenly.

8.2.2.1 Agent-policy comparison

The normal and gradient agent determines the action preferences differently. Tests were executed to compare these two agent types, combined with one of the five possible non-contextual policies. All the test started with creating a pool of 20 randomly generated virtual patients. For every virtual patient, an exper- iment was repeated 100 times. A single experiment exists of 100 consecutive interventions of the virtual robot. The action preferences were learned over these 100 consecutive interventions, and reset in every new experiment. After every intervention, the received reward and executed action were saved. Further analyses combined all these interventions and experiments for all the virtual patients in the pool and calcu- lated the average reward, action optimality score and total regret value for every intervention timestep. The action optimalities were calculated using the predefined preferred action of every patient in the pool and the registered exe- cuted actions. The total regret score in every observation is the difference between the reward value registered during that observation and the best possible reward value for this problem. The reward signal lies in the interval [0, 1] and therefore, the best possible reward value is always 1. The results of the test cases with different bandit algorithms are summarised inside a scoreboard figure. These score- boards compare the average cumulative total regret scores of the different experiments over the 20 different patients. The cumulative total regrets value defines the averaged total loss in reward over the 100 interventions. The lower bound of this score is 0 and the upper limit 100 for 100 consecutive interventions because of the respectively optimal reward values in the interval [0, 1]. Figure 8.9 shows such a scoreboard for the differences between the bandit algorithms when using the optimistic case for the action influences. The colour rating in this scoreboard visualises how optimal this policy is compared to the best possible bandit algorithm, the algorithm which will return a zero cumulative total reward. The more a bandit algorithm is near this optimal bandit algorithm, the bluer its cell will be. As discussed in Section 6.2, the random policy performs badly for this problem but can be used as a good lower bound for the other bandit algorithm performances. In the scoreboard Figure 8.9, the gradient agent should use an active baseline. Disabling this baseline parameter resulted in significative less optimal regret scores compared with the bandit algorithms where this baseline parameter was active. The other gradient parameter, the step-size parameter α, must always be greater than zero. Otherwise, it will neglect the update values and act almost randomly in all the bandit algorithms. The UCB policy outperforms all the other policies in this test. The lowest cumulative total regret scores were encoun- tered using this policy, for this test situation. Insights in the performances of the, at first sight, beneficial bandit algorithms, are given in an overview figure con- taining three different plots. Figure 8.10 gives such an overview plot for the scoreboard Figure 8.9.

65 66

Figure 8.9: Scoreboard with the total cumulative regret scores for different bandit algorithms when the optimal strategy is simulated for a single patient. The three different graph types are the average reward plot over the multiple interventions on top, a graph visualising the percentage of optimal action selections over the multiple interventions in the middle and the cumulative total regret scores evolving over the multiple interventions at the bottom. The overview in Figure 8.10 visualises these three plots for two gradient agents using an UCB policy and a Greedy policy respectively, and two normal agents using an UCB policy and the random policy. The results of the random bandit algorithm were added to show the worst case solution for this situation.

Figure 8.10: Detailed overview of the most promising algorithms of the scoreboard in Figure 8.9, or the single patient optimistic case.

The optimal action selection procedure for the UCB policy, while used with both a normal and gradient agent, is shown in Figure 8.10. The Greedy policy grows more smoothly to the optimal value over the multiple interventions when used with a gradient agent, but it does not reach the optimal action selection procedure in comparison with the UCB policies. Only a small difference can be denoted in the smoothness of the optimal action curves between the normal and gradient agent when used with a UCB policy. Compared with the other policies, the UCB policy finds the optimal action usually faster. In the first few interventions, less optimal actions are selected to reduce the amount of uncertainty of a possibly better action. This behaviour is shown in the first four time steps in the average reward graph of Figure 8.10 and can give the impression of unwanted behaviour. However, other bandit algorithms will spread this uncertainty over more than these four initial time steps. Due to the average values over multiple tests, this behaviour is less visible in the results. An Epsilon-Greedy-based bandit algorithm will select, for example, the less optimal action more frequently, while the UCB policy evolves fast to an optimal action selection procedure. Similar tests were executed for the other two action influence strategies. The scoreboard for the similar influences case and for the worst case can be found in Figur 8.11 and Figure 8.12. The same test procedure using 100 consecutive interventions in 100 experiments for 20 randomly created virtual patients was utilised in these tests.

67 68

Figure 8.11: Scoreboard with the total cumulative regret scores for different bandit algorithms when a similar-looking action influence strategy is simulated for a single patient. 69

Figure 8.12: Scoreboard with the total cumulative regret scores for different bandit algorithms when worst case strategy is simulated for a single patient. For these similar and worst case strategies, an increase in total regret scores is markable. The previously beneficial bandit algorithms are less optimal in the similar case and even have high regret scores in the worst case. These increases are normal, because the reward signals for these strategies are very low, even when the most beneficial action was selected. Comparing these algorithms with the best possible reward value of 1, result in these higher regret scores. The UCB policy performs still has a better total regret score in these two cases when compared to the other policies in these tests. Looking at the detailed view for these promising UCB bandit algorithms in Figure 8.13 and Figure 8.14 for the similar and worst case respectively, shows the clear advantages of this UCB policy. The differences between the optimal bandit algorithms are less obvious when using similar-looking action influences, and the optimal action is selected in only 50 percent of the interventions. In this case, the virtual PwD reacts similarly to all the actions and still gets the relatively high optimality score. The average reward values are therefore surprisingly. The adaptive Epsilon-Greedy policy, shown in the detailed overview of Figure 8.13, has a similar effect but is less accurate in selecting the most beneficial action patterns. For the worst case strategy, this same explanation is applicable but with a higher optimal action selection rate. The two UCB bandit algorithms have high total regret scores and very low average reward values, but the optimal policy still selects the preferred action with a chance of 75%. Even for this worst case strategy, the UCB policy finds the slightly less bad actions to exploit.

Figure 8.13: Detailed overview of the most promising algorithms of the scoreboard in Figure 8.11, or the single patient with similar-looking action influences.

70 Figure 8.14: Detailed overview of the most promising algorithms of the scoreboard in figure 8.12, or the single patient worst case.

The first tests in Section 8.2.2.1 conclude that the UCB policy performs extremely well for this bandit problem. For the optimistic, similar and worst case strategies, this UCB policy leads to a low total cumulative regret score and high optimal action selection patterns after only a few exploring interventions. Both the gradient agent and normal agent have similar effects using this UCB policy, but when the gradient agent is used, the use of a baseline and an α > 0 are important. More interesting are the best performing algorithms when action influences are similar, which are more or less the same algorithms from the optimistic strategy. When the reward values are almost the same for every action, the normal agent with an UCB policy can still select the most beneficial actions 48% of the time. The UCB policy can efficiently determine the most beneficial actions after exploring a small number of interventions to reduce the amount of uncertainty. The other bandit algorithm, using an adaptive Epsilon-Greedy policy and a gradient agent, has similar effects but is slightly less accurate in selecting the most beneficial actions during the first ten interventions. For the worst case strategy, this same explanation is applicable but with higher optimal action selection rates. The parameters for both the gradient agents and several different policies were not chosen arbitrarily. Appendix B de- scribes the tests for finding the optimal parameters for the Epsilon-Greedy and UCB policies. The other policy and agent parameters use similar approaches to determine their optimal values. All the following tests were performed with an active baseline and α = 0.8 because different α values do not give additional information about the optimality of a chosen bandit algorithm. The confidence level parameter of the UCB policy was mostly kept to c = 0.1.

71 8.2.2.2 Noisy sensor

Another interesting test case is to investigate how the bandit algorithms perform when the sensors are noisy. There- fore, the effect of noise on the facial expression values was analysed between several different bandit algorithms. A noise level parameter was systematically increased for several test iterations. Again, these tests simulated 100 in- terventions for a virtual PwD and was repeated for 100 times and 20 randomly created patients per different noise level. The total regret values after 100 interventions for varying noise levels were analysed and visualised by a detailed view for the most promising algorithms determined in Section 8.2.2.1. The results for both the gradient and normal agent with an UCB policy are given in Figure 8.15 for the optimistic strategy. The different total regret values when the noise level increases are shown in the first graph of this Figure 8.15. The other three graphs of this figure show the detailed view of the algorithms for a random noise level infecting the real sensory data for about 61%. The average reward for all bandit rewards become similar when the noise level increases. The gradient agent with UCB policy can still select the most preferred actions with high probability.

Figure 8.15: Detailed overview of several bandit algorithms when noisy sensory data influenced the learning process for an optimistic strategy.

72 The same test was executed for the worst case strategy and Figure 8.16 visualises these results. The first graph of this figure shows a decrease in total regret score when the noise level increases. The average reward values for all bandit algorithms increase when noise is added in this situation, but this increase does not influence the optimal action selection procedure. The other three graphs of Figure 8.16 show the detailed view of the algorithms for a random noise level infecting the real sensory data for about 61%. The differences between the algorithms are subtle, but noise added to the reward values will make them look similar. Therefore, the similar-looking action influences strategy would not give interesting results when noise interoperated the data, and the results were not listed here.

Figure 8.16: Detailed overview of several bandit algorithms when noisy sensory data influenced the learning process for a worst case strategy.

Increasing the noise on the sensory data will decrease the accuracy of the bandit algorithms. The test in Section 8.2.2.2 concludes that a high noise level will make the average rewards look similar and the bandit algorithms will have more difficulties in determining the most optimal actions. In the optimistic case, the UCB policy with a gradient agent will need more time to determine the optimal actions. For the worst case strategy, the total regret scores decrease because of the increasing average reward values. However, the optimal action rates decrease when the noise level increases. It is clear that none of the proposed bandit algorithms can handle a significant amount of noise and will eventually all become identical. The different bandit algorithms did not differ a lot from each other when noise incorporated the data.

73 8.2.2.3 Change in behaviour

The previous tests mimicked the patient’s behaviour for each of the 100 interventions identically. However, changes in behavioural patterns frequently occur for PwD. This test will investigate how the bandit algorithms react to these pattern variations. At the 50 intervention of every experiment, the simulated patient chooses a different optimal action in comparison with the current one. The bandit algorithms must try to detect this change in behaviour and adapt their optimal action procedure. In total, the 100 experiments for each of the 20 randomly created virtual patient executed 120 consecutive interven- tions per experiment. The results are visualised using the detailed overview graphs, visualised in Figure 8.17. Actions still influenced the patient’s characteristics according to the optimal strategy.

Figure 8.17: Detailed overview of the most promising algorithms for the change in behaviour pattern recognition.

A clear decrease in performance is noticed around the 50th intervention. Both the average reward values and per- centages for optimal action selections drops to zero, and the cumulative regret value curves change to a more linearly increase. The bandit algorithms can all detect this change in behaviour and adapt their policies according to the newly received information.

74 The gradient agent with a UCB policy adapts its action selection procedure within 20 interventions. The gradient agent with an adaptive Epsilon-Greedy policy reacts quickly to the change in behaviour but has more difficulties in optimising its action selection plan again. The other bandit algorithms need more time to adapt their selection procedures. Pattern changes are quite common for PwD and all the bandit algorithms can detect and react to these changes in behaviour. The stability of the UCB bandit algorithms is shown in the cumulative regret graph of Figure 8.17. The cumulative total regret value at the end of the experiment increases the smallest amount for the UCB policy in com- bination with a gradient agent. The UCB policy together with a normal agent has a similar adaptation pattern, but needed more time to optimise its action selection procedure in this test.

8.2.3 Multi-Patients tests

In this section, bandit algorithms will try to learn the action preferences for multiple patients. Section 8.2.2 anal- ysed bandit algorithms for several different patients in different experiments. During this section, every intervention will analyse a different patient, and the bandit algorithms will have to deal with these different patient occurrences. Contextual information is now available, and contextual agents can be used to learn to act globally. First, the contextual bandit algorithms are analysed more in detail to show the effects of the global learning algo- rithms. A second test case will compare the results between using a one-bandit-per-patient strategy and a single- bandit-for-all one. Different policies and agents were compared. Next, noisy sensors were taken into account to search for robust bandit algorithms. Behavioural pattern changes were analysed in a fourth test case, similarly to the test case in Section 8.2.2.3. At last, a virtual external sensor was added to show the effect of using combined reward values.

8.2.3.1 Contextual bandit algorithm

Contextual bandit algorithms can use the information gathered from several different interventions to predict the action preferences for the currently intervented patient. These predictions are based on all the previously performed interventions, without differentiating between patients or BDs. Therefore, the bandit algorithms must be able to distinguish between these situations by itself in order to execute a personalised action for all the patients. All policies described in Section 6.2 can be used together with a contextual agent. In a first test, these policies were compared with this contextual agent. The simulator chooses in every intervention a virtual patient from a pool of 30 randomly created virtual patients. Remark that different patients are selected in consecutive interventions, while the tests in Section 8.2.2 choose a different patient per experiment. This experiment was repeated 100 times and for 20 different patient pools. Figure 8.18 shows the detailed overview when actions influence the patient characteristics according to the optimal strategy. The contextual agent uses all the information of the virtual PwD to determine which action is the most probable. When a contextual agent with a policy different than the UCB and random policy is analysed, the action selection performance increases to more than 80% after 100 interventions for all the different patients. These bandit algorithms benefit from the additional knowledge available inside the contextual agent and exploit the preferred actions quite often. The average reward values increase slowly to their optimal values.

75 Figure 8.18: Comparison between the different contextual bandit algorithms.

The exploration-exploitation problem is different for these bandit algorithms, and the previously beneficial UCB policy does not perform well in combination with this contextual agent, in this multi-patient setting. The UCB policy will usually select the first action during the first intervention, which has a high probability to be optimal because of the high possibility of a virtual PwD to have the AD and the associated optimal action to this dementia disease which was unintentionally the first on the action list. The UCB policy will first reduce the uncertainty for the other possible actions which results in the decreasing pattern during the first 50 interventions. Afterwards, enough certainty for all patients is gathered to exploit the best action, which it had unfortunately selected already in the first intervention. Similar tests, with a different action list, were executed, because the initially selected action influences the whole action selection procedure. The results are shown in the detailed overview of Figure 8.19. Here, the action optimality curves grow slowly to their optimal values. The end results of both tests are however the same for this UCB policy, and it is still less beneficial than the other policies. The more beneficial policies are the contextual policy, Greedy policy, adaptive Epsilon-Greedy policy and Epsilon-Greedy policy. The contextual policy even outperforms the other ones slightly due to its predictive capacity.

76 Figure 8.19: Comparison between the different contextual bandit algorithms but with a different initial action list to differentiate from the test in Figure 8.18.

8.2.3.2 Agent-policy comparison

The action preferences for a single patient were learned more quickly. It could be beneficial to build a system where the preferred bandit algorithms of Section 8.2.2 are used for every patient separately. This system could learn the optimal action procedures more quickly than the contextual single-bandit-for-all algorithms. During this test, the bandit algorithms analyse a pool of 30 random virtual patients during 200 interventions. This experiment is repeated 100 times and this for 20 randomly created patient pools in every experiment. The differences between the one-bandit-per-patient algorithms and the single-bandit-for-all algorithms, using optimistic action in- fluences, are visualised in the scoreboard Figure 8.20. The contextual agent uses all the information of the virtual PwD to determine which action is the most probable. The other agents have a mapping from patient to actions and update these values for every patient specific. Different policies and agent types were compared in this situation. The contextual policy was avoided from these tests because the gradient and normal agent do not have the required predictive capacities for this policy. This scoreboard shows the advantages of using a single-bandit-for-all contextual algorithm, to determine the most probable action. Figure 8.21 gives the details for the most promising algorithms.

77 78

Figure 8.20: Scoreboard with the total cumulative regret scores for different bandit algorithms when optimal strategy is simulated for multiple patients. Figure 8.21: A detailed overview of the most promising algorithms of the scoreboard in Figure 8.21.

The Greedy policy in combination with a contextual agent benefits from the contextual information during the task of learning the optimal action patterns over multiple patients. The UCB policy shows the disadvantages of this exploration algorithm again: it first reduces the amount of uncertainty of all the possible actions for every single virtual patient, while the first select actions were already nearly optimal. This result was also affected by the initially selected action, but analogue to the previous test case, the results after 200 interventions are still worse than the results of the other bandit algorithms. These same tests were repeated for both the similar action influences and worst case strategies. The scoreboard plots of these strategies can be found in Figure 8.22 and Figure 8.23 respectively. When action influences the patient’s characteristics similar, the algorithms have difficulties in determining the optimal action. The gradient agent outperforms the contextual agent for all policies except the UCB policy. A detailed view of these algorithms is given in Figure 8.24 The contextual agent has difficulties in learning the action preferences because of the similarity between the reward values. The action predictions inside the contextual agent fluctuate because in one case, the preferred action is action a while in a next step the preferred action is b. The gradient agent can differentiate between these small values more easily and will try to exploit them. In this situation, this results in a higher average reward value, but the optimal action selections do not differentiate from the contextual bandit algorithms.

79 80

Figure 8.22: Scoreboard with the total cumulative regret scores for different bandit algorithms when similar action influences are simulated for multiple patients. 81

Figure 8.23: Scoreboard with the total cumulative regret scores for different bandit algorithms when worst case strategy is simulated for multiple patients. Figure 8.24: A detailed overview of the most promising algorithms of the scoreboard in Figure 8.22, or the multi-patients similar case.

When these bandit algorithms are compared for a worst case strategy, differences are small, and all bandit algorithms are close to the worst possible regret score. Figure 8.25 gives the detailed view of this worst case strategy. The average regret values are extremely low for all the bandit algorithms. The optimal action is selected in only 30% of the interventions and does not further increase significantly after the 50th intervention. Contextual bandit algorithms are very efficient in learning the action preferences for multiple patients using the addi- tional available contextual information. However, when the action influences are more similar, the action preferences fluctuates more than needed. This drawback could result in a similar effect as the random policy because different actions are selected in consecutive interventions for the same PwD. In this case, the set of actions will all have similar effects and would probably not result in huge distractions. If it however does, it should be noticed by the robot sensors and will lead to a lower reward value. The system will then learn to avoid these actions.

82 Figure 8.25: A detailed overview of some algorithms of the scoreboard in Figure 8.23,or the multi-patients worst case.

8.2.3.3 Noisy sensor

The same noisy sensor tests from Section 8.2.2.2 were repeated in a multi-patient setting to investigate whether these contextual algorithms were prone to noise. The noise level of the robotic sensor was increased for several iterations of the test during 200 interventions for a group of 30 virtual PwD, averaged over 100 experiments and this for 20 randomly created patients pools. The total regret value for multiple noise levels is analysed using a same detailed view for the bandit algorithms as shown in Figure 8.15. Figure 8.26 shows the beneficial bandit algorithms using an optimistic action influence strategy. The cumulative total regret after the 200th intervention for different noise levels is used in the first graph of this figure. The other three graphs visualise the detailed view of the algorithms for a random noise level infecting the real sensory data for about 61%. The average reward values for all these bandit algorithms become similar when the noise level increases. The contextual agent can still select the most optimal actions for about 70%, but there is no clear benefit in robustness compared to the other bandit algorithms.

83 The same comparison was made for the worst case strategy and Figure 8.27 summarises the results. The total regret decreases when the noise level increases for this action influence strategy. The detailed view of the algorithms for a random noise level incorporating the real sensory data for about 61% shows that the average reward for all bandit algorithms increases when the noise level increases, but this increase does not influence the action selection procedure. The differences between the algorithms are again subtle. Noise added to the reward values will make them look similar. Therefore, the neutral strategy would not give inter- esting results when noise incorporates the data.

Figure 8.26: A detailed overview of several multi-patient bandit algorithms when noisy sensory data influenced the learning process for an optimistic strategy.

84 Figure 8.27: A detailed overview of several multi-patient bandit algorithms when noisy sensory data influenced the learning process for a worst case strategy.

8.2.3.4 Change in behaviour

As discussed during the single patient tests in Section 8.2.2.3, changes in behavioural patterns are quite common for PwD. This test will test a similar change in behaviour to analyse the adaptation speed of the multi-patient bandit algorithms. At the 200th intervention of this test, the characteristics of the virtual PwD changes and a different preferred action is possible. The bandit algorithms will try to detect this change in behaviour and adapt their optimal action policy. 400 interventions were analysed for a group of 30 virtual PwD, and every test was repeated 100 times and this for 20 randomly created patient pools. The results for this test case are visualised in the detailed overview plot of Fig- ure 8.28

85 Figure 8.28: A detailed overview of the most promising algorithms for the change in behaviour pattern recognition for multiple patients.

The one-bandit-per-patient bandit algorithms have a large decrease in average reward values and less optimal action selection patterns after the 200th time step. This behaviour is not uncommon because of the earlier analysed results in Section 8.2.2.3 However, the contextual agent with a contextual policy outperforms all the other bandit algorithms. Using the con- textual information from previous interventions, the agent knows which information influences the action selection procedures for different types of PwD. When the behavioural pattern changes, the contextual information can be used to determine the more beneficial actions. A contextual policy uses exactly this information to predict the action proba- bilities and selects the most beneficial one. This bandit algorithm has a small decrease in performance and can recover very fast to the optimal action selection procedure for all the patients.

8.2.3.5 Additional virtual sensor

During the test in Section 8.2.3.2, which influenced the patient’s characteristics according to a worst case strategy for multiple patients, the total regret scores were very high as shown in the scoreboard of Figure 8.23. Additional sensors could help to give more information about the executed action, and can provide the bandit algo- rithms with additional knowledge to improve the action selection procedure. These additional sensors can also be useful when the robot cannot determine the facial expressions of the patient or even detected a negative expression while this was not the case. During tests in this section, multi-patient bandit algorithms are tested using an additional virtual sensor with reward signals accurate for 80% of the time. The worst case strategy was used to generate the inaccurate reward values from the robot’s sensors. The bandit algorithms analyse a pool of 30 random virtual patients during 200 interventions. This test was repeated

86 100 times and this for 20 randomly created patient pools. Figure 8.30 displays a scoreboard for the differences be- tween the bandit algorithms. When this scoreboard is compared with the scoreboard in Figure 8.23, some improve- ments are notable. A detailed overview of the most promising algorithms are given in Figure 8.29. The total regret scores show a substantial improvement when using an additional sensor with the worst possible action influences detected by the robot. The overview of the most promising algorithms even shows a high increase in optimal action selections. The most beneficial actions are selected for about 80% at the end of the intervention when a contextual agent is used together with a Greedy or adaptive Epsilon-Greedy policy. The contextual bandit algorithms easily outperform the other agents using this additional sensor.

Figure 8.29: A detailed overview of the most promising algorithms of the scoreboard in Figure 8.30, or the multi-patients worst case with an additional sensor.

87 88

Figure 8.30: Scoreboard resembling the optimal regret for a worst case strategy when an additional sensor is simulated. 8.3 Conclusion

In this chapter, all components discussed in previous chapters were combined in a simulated application. With this simulator, multiple tests can be executed very fast and without the possible negative impacts onto PwD. Of course, a simulated environment will abstract some essential characteristics from the real world and completely relying on these simulations for selecting an appropriate bandit algorithm can result in unwanted behaviour. The results of the tests, performed in this chapter, are optimised for the simulated environment, and the conclusion can be different in real situations. However, bandit algorithms with a very low performance during the simulated test can be ignored in further tests. During the first couple of tests, bandit algorithms were compared when only the preferred action must be decided for a single patient, which behaved similarly during each intervention. These were so-called single patient tests. Different test cases were analysed in Section 8.2.2, such as different action influences, incorporating noise into the reward signal and analysing a change in the behavioural pattern when the optimal selection strategy was already determined. These various tests showed that the UCB policy had some beneficial properties to deal with the problems defined in this dissertation. The small exploration phase and the high accuracy of the exploited actions make this policy highly useful for this bandit problem. When combined with a gradient agent, this policy can even react fast on pattern changes. When a lot of similar rewards are produced for various actions, this policy can still determine which actions are more preferable. Noisy sensors are harder to deal with, but no other policies perform better than the UCB policy. During a second part, patients could differ during the consecutive interventions of a test, and the bandit algorithms have to learn the preferred actions for multiple patients at once. These tests were called multi-patient tests, and two different strategies were applied. The first strategy took the benefits from the single patient tests and used a different bandit algorithm for every patient. Personalised interactions can be guaranteed using this approach, but a significant amount of interventions will be needed to let all the bandit algorithms converge. Another strategy is to use contextual information to differentiate between patient using their characteristics and, hopefully, find some similarities between these patients and their action preferences. Fewer interventions will be needed to learn this one- bandit-for-all algorithm but whether this approach will lead to a personalised intervention strategy for each patient is rather unknown. The different test in Section 8.2.3 compared these two strategies, and as mentioned, the one-bandit-per-patient strat- egy needed more interventions to reach a same level of optimality compared with the contextual bandit approaches. These global learning techniques had the additional advantage to very efficiently deal with changes in behavioural patterns, as tested in Section 8.2.3.4. The cases were these algorithms have difficulties to determine an appropriate reward signal, can be easily resolved by adding additional sensors. Accurate reward values are critical, because the whole decision process relies on these feedback signals.

89 9 LEARNING APPLICATION

Chapter 8 used a simulator to test different situations where some of them were described in Section 5. All these simulated tests proved the possible benefits of learning from interacting with PwD and determing the personalised interactions is possible with the use of bandit algorithms and robotic sensors. However, these tests were designed in an optimal environment were robotic failures did not happen, and sensory data was highly accurate. In this section, the simulated application will be adapted to run in a NH, and several brief tests with a real Nao robot were executed to show the correct functioning of this more realistic application.

9.1 Architectural design

The proposed real-time application has several external instances, such as a database and a real Nao robot, with which it should communicate. Figure 9.1 shows all the necessary interactions to perform an intervention. Before such an intervention takes place, the application should install all the proper modules to start the communication with the robot. When an intervention is needed, the application will signalise the bandit package, and there the corresponding bandit algorithm will determine the most beneficial action, whether or not based on contextual data of the patient. After an action is selected, the bandit package will inform the Nao robot to start performing. The robot will analyse the PwD during the whole intervention and sends its sensory data through an API directly to a database. After the intervention is finished, the bandit package will gather all this sensory data and determines the reward value for this intervention. At last, the patient’s profile is updated with the results of the last executed action.

Figure 9.1: Overview of the robotic application

Both the application and bandit package will run on an external system and not directly onto the robot because of memory and computational constraints. The two databases can run on this same system, but this is not required. In most of the NH, a patient database will already be available and this database stores all personal information of

90 the PwD. The wonder database is designed to work with this application and will the data from both the robot and environmental sensors. A the time series database, such as the in Section 7.1.3 defined InfluxDB, will be sufficient for these tasks. A full class diagram of the implemented application is given in Figure 9.2. A brief explanation of the different classes is given next.

Figure 9.2: Class diagram of the robotic application

9.1.1 Application

The main class of the designed system. The Application class will start the communication with the Nao robot and register the WonderModule afterwards. It will wait for a BD event and initiates the intervention when such an event is received.

9.1.2 Wonder module

The Wonder module is essentially identical to the Nao Handler class in Section 8.1.4.2, but will know push the gathered facial characteristic values, using an API, directly to a time-series database instead of sending them back to the bandit package. The robot can execute actions independently from the bandit package and does not have to store the sensory data internally, using this approach.

9.1.3 Event Manager

During simulation, every single test created a new virtual robot and analysed the sensors directly when this object was active. In this real application, the robot is already up and running, and a BD can happen anytime. Therefore, it should be beneficial to announce BD events and let the application react upon such event types. How the events should be generated is not implemented in this dissertation, and the tests in this chapter will fire the events manually. The EventManager class defines the reactions on these events and implements the observer pattern: When a BD event occurs, the event manager will notice this and notifies all the currently registered handlers. These handlers can react upon this event and can, for instance, change their behaviour based on the type of the event. The application class is such a handler and will notify the bandit package to execute an action when an event was fired.

91 With this design, other types of sensors can be triggered when a certain event occurs, and evaluate them separately from the robot.

9.1.4 BanditSetup

This class eventually gathers all the results and updates the corresponding records in the patient’s database. The BanditSetup class looks very similar to the Environment class of Section 8.1.2, because it links the agent and bandit together and executes the same following actions of a bandit algorithm:

action = agent.choose(patient) reward = self.bandit.pull(action, patient) agent.observe(reward, patient)

9.1.5 Bandit package

Is the same bandit package described in Section 8.1.1, but with all virtual settings set to false.

9.2 Robot tests

Some brief tests were performed to verify the correctness of this designed application and to show the bandit package can learn to make the correct decisions in real life situations. Two different test persons were asked to mimic facial expressions in front of a Nao robot. The results of these two tests are given in this section.

9.2.1 Gradient agent performance

During the simulations, the gradient agent combined with a UCB policy gave some promising results when the action preferences must be determined for a single patient. During this test, this UCB bandit algorithm was used in a more realistic situation, with a real Nao robot who examines the facial characteristics of a real person. 20 interventions were executed where the UCB bandit algorithm determined which action to execute. The results of the facial expressions were stored in a time-series database using the REST API described in Section 7.2.2.1 and afterwards analysed by the bandit algorithm. Two different people tested this real application, but they did not suffer from a dementia disease. During 20 inter- ventions, the robot pronounced four different actions, and these people reacted to these proposed actions according to the schedule in Table 9.1.

action reaction 0 Show a neutral face 1 Smile or laugh 2 Look scared or bored 3 Be angry/do not look at the robot

Table 9.1: Expected reactions according to the robotic actions

The two test persons were respectively 24 and 27 years old, both males.

92 9.2.1.1 First test approach

The facial expressions of the first test person were registered during all interventions with a duration of 30 seconds per performed action. Averages of the facial expressions, organised per action, for the first test person are given in Table 9.2

action neutral happy surprised angry sad 0 0.42682725 0.04028256 0.01780493 0.17542878 0.15783828 1 0.13136363 0.44161287 0.02447744 0.16481418 0.12662075 2 0.27100786 0.18650316 0.05883752 0.25232805 0.23132339 3 0.15066298 0.32062203 0.02318146 0.30715455 0.11290887

Table 9.2: average facial expression scores for the first test person

The registered average facial expressions in Table 9.2 show that this test person could easily make a difference between a neutral and happy facial expression. Looking scared or bored was harder for this person, and the robot detected a high number of positive facial expressions when angry expressions should be expressed. The selected action and reward values during the interventions are visualised in Figures 9.3 when a gradient agent using a UCB policy was used with the confidence parameter c = 0.01. The labels for each data point are the chosen actions of the bandit algorithm

Figure 9.3: Reward values of the gradient agent using an UCB(c=0.01) policy for the real robotic test of person 1.

The application received a high reward around the 9th intervention when action 1 was selected. The bandit algorithm was suggested to choose this action more often because of this substantial reward. Between the 14th and 18th inter- vention, the reward values were not as expected and the certainty about its optimality reduced. The strengths of the UCB policy are shown in this small example because the 20th intervention selected a sub-optimal action to get more certainty on the current action selection procedure.

9.2.1.2 Second test approach

The second test person’s expressions were recorded during all interventions with a duration of 10 seconds per action to investigate the robot’s detection capacities for shorter actions. This test person was asked to ignore the robot completely when action 3 was announced. Averages for the facial expressions, organised per action, for the second test person are given in Table 9.3.

action neutral happy surprised angry sad 0 0.10785714 0.00642857 0.01333333 0.13785714 0.2345238 1 0.05774646 0.33179213 0.10022848 0.07839935 0.35491048 2 0.12407637 0.06319917 0.03101135 0.0945098 0.35386996 3 0. 0. 0. 0. 0.

Table 9.3: average facial expression scores for the second test person

93 Table 9.3 shows that the second test person had difficulties to show neutral and happy facial expression. Facial ex- pressions of the fourth action were never registered because this person was supposed to ignore the robot during the execution of the fourth action. The selected action and reward values during the interventions are visualised in Figures 9.4

Figure 9.4: Reward values of the gradient agent using an UCB(c=0.01) policy for the real robotic test of person 2.

This test shows clearly the adaptation rate of the bandit algorithm. The selected action in the first intervention should be the most beneficial one, but the application received a relatively low reward value. The most useful action was chosen again around the 6th intervention, but because the 7th intervention was low again, much uncertainty remains in the learning algorithm. The application started to exploit this beneficial action when the certainty about the other, not beneficially actions was increased between the 7th and 9th intervention.

9.3 Conclusion

In this chapter, a more realistic application is designed to learn the action preferences of PwD residing in a NH. The ap- plication can run autonomously when a Nao robot is already up and running and can be integrated within the WONDER project later on. Several brief tests were executed using this designed application with two real test persons and a real Nao robot. The test persons were asked to mimic several, predefined per action, facial expressions in front of the robot. The robot analysed these expressions and pushed the values to a database. The learning system analysed these facial expression data after the intervention was finished and built an appropriate reward signal for the executed action. The bandit algorithm used in both test cases was the gradient agent combined with an UCB policy. Both the two tests visualised the correct functioning of the designed application of Figure 9.1. The application can make accurate decisions, even if the robot was unable to detect the facial expressions during certain actions. Additional sensors can help to build a more precise reward signal when the facial expressions are scarce. The reward signal can be determined for both short and longer actions. Actions with a longer duration will have more time to build an accurate reward signal because more facial expression will be captured. The rather smaller actions can still produce a large reward, due to the weighted averages inside the service personalisation algorithm, described in Section 6.3.

94 10 CONCLUSION AND FUTURE RESEARCH

The dissertation investigated whether self-learning algorithms could be designed for the personalised interaction for people with dementia when a particular behavioural disturbance occurs and needs to be elevated. This last chapter will revise the conclusions of this dissertation and bring it to its closure. This chapter is designed in three distinctive sections where a summary of the whole study is given in Section 10.1. Conclusions made by the several performed tests are presented in a Section 10.2 while suggestions for future research are outlined in Section 10.3.

10.1 Summary

The occurrences of behavioural disturbances, such as agitation and aggression, are still the main reason to hospitalise people with dementia in a nursing home. The increased strain on the available resources within health care resulted in a very low amount of caregivers in these specialised centres. Therefore, medical treatments are often used to sup- press these disturbances. The drawback of these rather small medical interventions is that they do not investigate the underlying needs of the patient and they often work only for a short amount of time. The IMEC WONDER project exam- ines how a robot could assist the nursing staff to use non-pharmacological interventions to elevate the behavioural disturbances and increase both the work conditions of the caregivers and the quality of life of the patients. This WONDER project builds an intervention strategy where a Zora robot can operate semi-autonomous, 24/7 on the nursing home floor to elevate the behavioural disturbance using person-centric care. The main goals of the WONDER project are the independent character of the robot, which should be able to walk or ride from one resident to another, and the ability to detect the manifestation of a behavioural disturbance automatically. This dissertation focused more on the interventions part of this WONDER project and how to make them person-centric while ensuring the executed intervention have a positive effect on the patient. Before solutions were investigated, a study about several different dementia types and their related disturbances was made. The main conclusion of this research was that there exists a clear link between the dementia type and the manifestation of the disease in a particular region of the brain. All the people with a dementia related disease suffer from one or more behavioural disturbances. Studies have shown that there is a relation between the dementia types and the occurrence of several behavioural disorders, together with the severity of the disease. Dementia diseases are still progressive, and the development of the disease can be analysed by several so-called Dementia Outcome Measures. These measures evaluate the patients based on several metrics from cognitive impairment and psycholog- ical symptoms up to their quality of life. Several studies showed these outcome measures evaluated positively when non-pharmacological treatments were used. Robotic therapy within healthcare became a lot more popular during the last decade because of their simplicity to work with and their ability to react and interpret several voice commands. Several studies compared the usability of these robots when they operated within nursing homes specialised for people with dementia. These studies showed that toy-looking robots like Nao had a positive effect on the neuropsychiatric symptoms of these patients. Most of these studies used robots in group therapy sessions with some preprogrammed exercises or interactions. Studies performing robotic person-centric interventions for individual patients were rather uncommon. Person-centric care can only be used when there is some knowledge about the patient and how he or she reacts to a particular situation during the execution of an appropriate action. When the nursing staff tries to elevate a disturbance of a patient, they use their experience gathered from previous interactions with this patient to act more efficiently and resolve these problems as quick as possible. These caregivers ’learned’ these interactions through practice. Learning the preferences from different patients is crucial for a person-centric care. When a robot should try to elevate these

95 disturbances, it could be beneficial to determine these preferences in a similar way humans do. People learn by trial and error, and some tasks can be learned more quickly than other ones. This trial and error learning approach is adopted in many robotic applications and is known as reinforcement learning. Reinforcement learning is a subdivision of machine learning, which is more generally defined as allowing computers to find useful information in data without explicitly program them. In reinforcement learning, this information is the link between the executed actions and the reward received from these performed actions. To find these links, an agent will perform actions and analyses the effects on the environment through sensors. A lot of different reinforcement learning problems exists with continuous action spaces and various states of environments to more simple cases were actions only influence the current situation. Here, the agent will be a robot which can execute a fixed amount actions and the environment is the patient’s current status. In this dissertation, a more simplified reinforcement learning problem is applicable because only one action should be performed during an intervention and the number of actions is limited. K-armed bandits were designed for such reinforcement learning problems and are historically defined as the problem in which a gambler is at a row of slot machines and has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution peculiar to that machine. The objective of the gambler is to maximise the sum of rewards earned through a sequence of lever pulls. This problem is analogue to the determination of the patient’s action preferences because the gambler resembles the robot, which can execute actions on people with dementia suffering from a behavioural disturbance instead of pulling levers at slot machines. After the explanation of the whole reinforcement learning problem, several bandit algorithms were developed by combining the essential components. A bandit algorithm exists of an agent which will execute and observe the actions, a policy which will determine which action should be performed and a reward signal. This reward signal is a feedback value gathered from the environment and will determine how good the executed action influences the patient. A distinguishable correct and accurate reward signal will be necessary to define which actions are the most beneficial to resolve a behavioural disturbance for a particular patient. The designed reward signal is based on an adaptive service personalisation algorithm and composed by several sensors to deliver this accuracy. Some sensory data will be more beneficial for certain disturbances than others. Three different types of agents were designed and analysed in this dissertation. A normal agent who took the reward values directly to influence the feedback scores, a gradient agent which differentiate between the relative differences between these actions to define more optimal feedback values and a contextual agent who used information from different patients to determine the influence of the feedback value on the patients. This contextual agent will be able to learn on a global scale with the benefits from previously executed actions on different patients. While the agents can determine the influence of each action on the patient, the action selection procedure is entirely controlled by an implemented policy. These policies make different choices regarding the exploration-exploitation paradigm. Before the best action should be exploited, enough information should be available about all these actions by exploring them. When the robot should stop exploring the action space and start exploiting the most beneficial one is rather unclear and depends on the situation and the patient. This dissertation compared six different policies to investigate their advantages regarding this problem. Testing all these different bandit algorithms in real time, using a robot at a nursing home would be too cumbersome for the nursing staff and could have adverse effects on the residing patients. Therefore, the bandit algorithms were first tested using a simulated application where different scenarios could be tested more efficiently without the negative side effects of failing sensors or unwanted behaviour of the patients. The faster execution of multiple interventions and the high amount of available virtual patients are two of the main advantages of this simulation approach. During the simulation tests, a clear benefit for the gradient agent using a UCB policy was notable when behavioural disturbances for a single patient were considered. This bandit algorithm was able to select the action preferences to elevate the behavioural disturbances in several different situations and had better results compared to other bandit algorithms in the case where the actions had comparable effects. Multi-patients tests showed the advantages of

96 learning on a global scale using the contextual agent. The full contextual bandit algorithm, where both the agent and policy are contextual, was able to find a link between the executed action and the patient’s information and could even use this information in further predictions to determine the most beneficial actions more quickly. The tests where the behavioural pattern of the patient changed at a particular timestep showed a clear benefit using this full contextual bandit algorithm over other bandit algorithms, even when these other algorithms had very promising results in the single patient case. Further test cases gave some more insights in the performance of the algorithms when the noise level of the sensors was increased heavily or when additional sensors were added. In this last case, an increase in accuracy of the reward signal will lead to a more accurate action selection procedure, and the incorporation of these additional sensors could be beneficial. In the last chapter, the simulated application was redesigned to work on a real robotic application. The Nao robot, which was used during this dissertation, have a limited amount of storage capacity and processing power and intensive applications should better not run directly on these robots. The bandit algorithms were in this respect already isolated from most of the other modules and could be easily integrated into a new application. Sensory data should not be pushed directly to the application to reduce the dependencies between the robot and the designed application. Instead, all sensors processing units pushed their values to a designed API which handles the further processing of these sensor values. This approach has the advantage that different technologies of sensors and databases could be used together and easily integrated into this design. The API stores the current values in a time-series database and the application can easily interact with this database to construct an appropriate reward signal when an intervention took place. The correct functioning of this real application is tested using two test persons. The gradient agent together with the UCB policy was used in a predefined situation to investigate the correctness of the combined components and to verify the learning abilities of this bandit algorithm using a real robot. The results were promising and showed the benefits of this design.

10.2 Conclusion

The self-learning aspects of this dissertation were fulfilled by the design of different bandit algorithms which could benefit from the learned experience over several interventions to eventually execute the most beneficial actions. The bandit algorithm using a gradient agent in combination with an UCB policy was able to determine the preferred actions very rapidly and this for different situations and both during real and simulated tests. The personalised interaction aspects were achieved by learning more globally over multiple patients while using con- textual information to distinguish their preferences from each other. These contextual bandit algorithms used this additional information to personalise the interaction for every patient. The additional contextual information resulted in additional benefits, such as the detection of a change in behavioural patterns and reactions upon this. At last, this application could be used to investigate the link between several dementia-related characteristics of a patient and their reactions during interventions. Studies showed already the link between non-pharmacological therapies and a dementia related disease but using this application, other aspects such as patient personal information or dementia outcome scores could show other possible interactions as well. This designed learning application can be used in the WONDER project, which will make personalised robotic interventions possible rather than the robotic group therapy sessions offered now.

10.3 Recommendations for further research

Several aspects during this dissertation were tested in a simulator, and these simulations did not guarantee the cor- rectness of the application in real world situations. Therefore, more real tests should be performed and preferably with elderly or people with dementia. First tests should acquire enough facial expression data to adapt the simulator to a

97 more realistic setting, and the different bandit algorithms should be compared again to make sure the best possible algorithms are chosen for further tests. This adaptation will require some field knowledge to tune the corresponding parameters. More integrated tests should then be executed in nursing homes when the WONDER project is operational, and the robot can autonomously move from one patient to another. Contextual information was used in several tests during simulation with multiple virtual patients but are never tested in a realistic setting. Several tests should be performed before the same conclusion from the simulation tests can be made. Only the contextual agent can fully benefit the use of this additional information. Other approaches could be investigated to let other systems benefit from this contextual information as well by determining clusters of similar patients based on the preferred actions. These methods could lead to more insights into the link between contextual information and the proposed selection procedures. When feedback values were mostly similar to all actions, tests showed that all bandit algorithms had difficulties to determine the preferred action. A solution to this problem was the ability to add additional sensors which were more specified to investigate a particular disturbance and could be used together with the robotic sensors to increase the accuracy of the reward signals. While two of these sensors were designed and implemented for both the wandering and screaming behavioural disturbance, real tests were never be executed using these sensors. Several other sensors could be investigated to detect and analyse even more different behavioural disorders and can be combined in a weighted manner to return an accurate reward value eventually. This, personalised for the current situation, personalised for the current PwD. The designed application was specified for the case where action preferences must be determined for people with dementia to elevate and prevent behavioural disturbances. This same application could be adapted in several other health care domains were learning from interactions is possible. For example, children with autism frequently use these social robots to interact and play with, in order to increase the social capabilities of the child. Learning robotic action preferences could be useful in this context as well. Small adaptations will be needed for the reward signal, but all other components can be reused in these studies.

98 BIBLIOGRAPHY

[1] Alzheimer’s Association, “Dementia Disease, Signs, Symptoms and Diagnosis,” 2017. [Online]. Available: http://www.alz.org [2] Alzheimer.net, “Alzheimer’s Statistics,” p. 1, 2016. [Online]. Available: http://www.alzheimers.net/resources/ alzheimers-statistics/ [3] Alzheimer Europe, “2013: The prevalence of dementia in Europe,” Tech. Rep., 2013. [Online]. Available: http://www. alzheimer-europe.org/Policy-in-Practice2/Country-comparisons/2013-The-prevalence-of-dementia-in-Europe [4] J. Scuvee-Moreau, X. Kurz, and A. Dresse, “The economic impact of dementia in Belgium: Results of the NAtional Dementia Economic Study (NADES).” Acta Neurologica Belgica, vol. 102, no. 3, pp. 104–113, sep 2002. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/12400248 [5] A. K. Desai, L. Schwartz, and G. T. Grossberg, “Behavioral disturbance in dementia,” Current Psychiatry Reports, vol. 14, no. 4, pp. 298–309, aug 2012. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/22644311 [6] Alzheimer’s Association, “Caregiver Stress,” 2017. [Online]. Available: http://www.alz.org/care/ alzheimers-dementia-caregiver-stress-burnout.asp [7] N. Kar, “Behavioral and psychological symptoms of dementia and their management.” Indian journal of psychiatry, vol. 51 Suppl 1, no. Suppl1, pp. S77–86, jan 2009. [Online]. Available: http://www.ncbi.nlm.nih.gov/ pubmed/21416023 [8] J. Cohen-Mansfield, “Nonpharmacologic Interventions for Inappropriate Behaviors in Dementia,” Am J Geriatr PsychiatryAm J Geriatr Psychiatry, vol. 94, no. 9, pp. 361–381, 2001. [9] J. Cohen-Mansfield, M. S. Marx, M. Dakheel-Ali, N. G. Regier, K. Thein, and L. Freedman, “Can agitated behavior of nursing home residents with dementia be prevented with the use of standardized stimuli?” Journal of the American Geriatrics Society, vol. 58, no. 8, pp. 1459–1464, 2010. [10] K. Wada, T. Shibatal, T. Musha, and S. Kimura, “Effects of robot therapy for demented patients evaluated by EEG,” in 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, 2005. [11] H. Robinson, B. A. MacDonald, N. Kerse, and E. Broadbent, “Suitability of Healthcare Robots for a Dementia Unit and Suggested Improvements,” Journal of the American Medical Directors Association, 2013. [12] M. Valenti Soler, L. Agüera-Ortiz, J. Olazaran Rodriguez, C. Mendoza Rebolledo, A. Pérez Muñoz, I. Rodriguez Pérez, E. Osa Ruiz, A. Barrios Sanchez, V. Herrero Cano, L. Carrasco Chillon, S. Felipe Ruiz, J. Lopez Alvarez, B. Leon Salas, J. M. Cañas Plaza, F. Martin Rico, and P. Marti, Martinez, “Social robots in advanced dementia,” Frontiers in Aging Neuroscience, vol. 7, no. JUN, 2015. [13] F. Martín, C. E. Agüero, J. M. Cañas, M. Valenti, and P. Martínez-Martín, “Robotherapy with dementia patients,” International Journal of Advanced Robotic Systems, 2013. [14] E. S. van der Ploeg, B. Eppingstall, C. J. Camp, S. J. Runci, J. Taffe, and D. W. O’Connor, “A randomized crossover trial to study the effect of personalized, one-to-one interaction using Montessori-based activities on agitation, affect, and engagement in nursing home residents with Dementia,” International Psychogeriatrics, vol. 25, no. 04, pp. 565–575, apr 2013. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/23237211 [15] Q. Hardy, “Artificial Intelligence Software Is Booming. But Why Now?” 2016. [Online]. Available: https: //www.nytimes.com/2016/09/19/technology/artificial-intelligence-software-is-booming-but-why-now.html

99 [16] Imec, “Project: WONDER.” [Online]. Available: https://www.iminds.be/en/projects/wonder [17] WebMD Medical Reference, “Dementia Stages, Causes, Symptoms, and Treatments,” p. 2, 2015. [Online]. Available: http://www.webmd.com/brain/types-dementia [18] Alzheimer’s Disease International, “Dementia statistics | Alzheimer’s Disease International,” p. 1, 2013. [Online]. Available: https://www.alz.co.uk/research/statistics [19] ICON, “Interventions for Wandering and Other behavioral disturbaNces of persons with DEmentia in nursing homes by personalized Robot interactions,” 2015. [20] R. J. Caselli and M. L. Windle, “Dementia Medication Overview Causes, Symptoms, Treatment,” p. 6, 2015. [Online]. Available: http://www.emedicinehealth.com/dementia_medication_overview.htm [21] L. Stephanie, “Dementia in Huntington’s Disease | HOPES,” 2010. [Online]. Available: https://web.stanford.edu/ group/hopes/cgi-bin/hopes_test/dementia-in-huntingtons-disease/ [22] T. D. Bird, Alzheimer Disease Overview. University of Washington, Seattle, 1993. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/20301340 [23] NHS, “Creutzfeldt-Jakob disease - Diagnosis - NHS Choices.” [Online]. Available: http://www.nhs.uk/conditions/ Creutzfeldt-Jakob-disease/Pages/Diagnosis.aspx [24] Bethesda and National Institutes of Health, “Creutzfeldt-Jakob Disease Fact Sheet,” 2016. [Online]. Available: http://www.ninds.nih.gov/disorders/cjd/detail_cjd.htm [25] C. E. Myers, “Memory Loss & the Brain,” p. 1, 2006. [Online]. Available: http://www.memorylossonline.com/ glossary/subcorticaldementias.html [26] Alzheimer’s Association, “Dementia with Lewy Bodies,” 2017. [Online]. Available: http://www.alz.org/dementia/ dementia-with-lewy-bodies-symptoms.asp [27] M.-J. Chiu, T.-F. Chen, P.-K. Yip, M.-S. Hua, and L.-Y. Tang, “Behavioral and Psychologic Symptoms in Different Types of Dementia,” Journal of the Formosan Medical Association, vol. 105, no. 7, pp. 556–562, 2006. [Online]. Available: http://dx.doi.org/10.1016/S0929-6646(09)60150-9 [28] B. Goodenough and L. Aerts, “Dementia Knwoledge Translation,” 2016. [Online]. Available: http://dementiakt. com.au/about-dementiakt/ [29] B. Reisberg, I. Monteiro, C. Torossian, S. Auer, M. B. Shulman, S. Ghimire, I. Boksay, F. G. Benarous, R. Osorio, A. Vengassery, S. Imran, H. Shaker, S. Noor, S. Naqvi, S. Kenowsky, and J. Xu, “The BEHAVE-AD assessment system: A perspective, a commentary on new findings, and a historical review,” pp. 89–146, 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/24714384 [30] A. K. Desai and G. T. Grossberg, “Recognition and Management of Behavioral Disturbances in Dementia,” Primary care companion to the Journal of clinical psychiatry, vol. 3, no. 3, pp. 93–109, 2001. [31] J. Cohen-Mansfield, A. Libin, and M. S. Marx, “Nonpharmacological treatment of agitation: a controlled trial of systematic individualized intervention,” J Gerontol A Biol Sci Med Sci, vol. 62, no. 8, pp. 908–916, 2007. [32] K. Sallam and A. M. R. Mostafa, “The use of the mini-mental state examination and the clock-drawing test for dementia in a tertiary hospital,” Journal of Clinical and Diagnostic Research, vol. 7, no. 3, pp. 484–488, 2013. [Online]. Available: http://www.jcdr.net/article_fulltext.asp?issn=0973-709x&year=2013&month=March& volume=7&issue=3&page=484-488&id=2803 [33] C. De Boer, F. Mattace-Raso, J. van der Steen, and J. J. Pel, “Mini-Mental State Examination subscores indicate visuomotor deficits in Alzheimer’s disease patients: A cross-sectional study in a Dutch population,” pp. 880–885, oct 2013. [Online]. Available: http://doi.wiley.com/10.1111/ggi.12183

100 [34] I. Hospital Pricing Authority, “Standardised Mini-Mental State Examination (SMMSE): Guidelines for admin- istration and scoring instructions,” 2014. [Online]. Available: https://www.ihpa.gov.au/sites/g/files/net636/f/ publications/smmse-guidelines-v2.pdf [35] L. Velayudhan, S.-H. Ryu, M. Raczek, M. Philpot, J. Lindesay, M. Critchfield, and G. Livingston, “Review of brief cognitive tests for patients with suspected dementia,” International Psychogeriatrics, vol. 26, no. 08, pp. 1247–1262, aug 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/24685119 [36] J. W. Ashford, V. Kumar, M. Barringer, M. Becker, J. Bice, N. Ryan, and S. Vicari, “Assessing Alzheimer severity with a global clinical scale,” Int.Psychogeriatr., vol. 4, no. 1041-6102, pp. 55–74, 1992. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/1391672 [37] T. Erkinjuntti, L. Hokkanen, R. Sulkava, and J. Palo, “The blessed dementia scale as a screening test for dementia,” International Journal of Geriatric Psychiatry, vol. 3, no. 4, pp. 267–273, oct 1988. [Online]. Available: http://doi.wiley.com/10.1002/gps.930030406 [38] G. C. Léger and S. J. Banks, “Neuropsychiatric symptom profile differs based on pathology in patients with clinically diagnosed behavioral variant frontotemporal dementia,” Dementia and Geriatric Cognitive Disorders, vol. 37, no. 1-2, pp. 104–112, 2014. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/24135712 [39] D. E. Clarke, J. Y. Ko, E. A. Kuhl, R. van Reekum, R. Salvador, and R. S. Marin, “Are the available apathy measures reliable and valid? A review of the psychometric evidence,” pp. 73–97, jan 2011. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/21193104 [40] P. Voyer, R. Verreault, G. M. Azizah, J. Desrosiers, N. Champoux, and A. Bédard, “Prevalence of physical and verbal aggressive behaviours and associated factors among older adults in long-term care facilities,” BMC Geriatrics, vol. 5, no. 1, p. 13, nov 2005. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/16280091 [41] N. Wongpakaran, T. Wongpakaran, and R. Van Reekum, “Discrepancies in cornell scale for depression in dementia (CSDD) items between residents and caregivers, and the CSDD’s factor structure,” Clinical Interventions in Aging, vol. 8, pp. 641–648, 2013. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/23766640 [42] K. K. Shankar, M. Walker, D. Frost, and M. W. Orrell, “The development of a valid and reliable scale for rating anxiety in dementia (RAID),” Aging & Mental Health, vol. 3, no. 1, pp. 39–49, feb 1999. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1080/13607869956424 [43] S. C. Smith, D. L. Lamping, S. Banerjee, R. Harwood, B. Foley, P. Smith, J. C. Cook, J. Murray, M. Prince, E. Levin, A. Mann, and M. Knapp, “Measurement of health-related quality of life for people with dementia: development of a new instrument (DEMQOL) and an evaluation of current methodology.” pp. 1–93, iii–iv, mar 2005. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/15774233 [44] P. C. Fiss, “Quality of life of residents with dementia in long- term care settings in the Netherlands and Belgium: design of a longitudinal comparative study in traditional nursing homes and small- scale living facilities,” Academy of Management Review, vol. 32, no. 4, pp. 1190–1198, oct 2007. [Online]. Available: http://amr.aom.org/cgi/doi/10.5465/AMR.2007.26586092 [45] S. Borson and K. Doane, “The impact of OBRA-87 on psychotropic drug prescribing in skilled nursing facilities.” Psychiatric services Washington DC, vol. 48, no. 10, pp. 1289–1296, oct 1997. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/9323748 [46] M. Ambasna-Jones, “How social robots are dispelling myths and caring for humans,” p. 25 August 2016, 2016. [Online]. Available: https://www.theguardian.com/media-network/2016/may/09/ robots-social-health-care-elderly-children [47] M. Gombolay, X. Jessie Yang, B. Hayes, N. Seo, Z. Liu, S. Wadhwania, T. Yu, N. Shah, T. Golen, and J. Shah, “Robotic Assistance in Coordination of Patient Care,” in Robotics: Science and Systems XII, 2016, pp. 1–11. [Online]. Available: http://www.roboticsproceedings.org/rss12/p26.pdf

101 [48] M. Heerink, J. Albo-Canals, M. Valenti-Soler, P. Martinez-Martin, J. Zondag, C. Smits, and S. Anisuzzaman, “Exploring requirements and alternative pet robots for robot assisted therapy with older adults with dementia,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8239 LNAI. Springer, Cham, 2013, pp. 104–115. [Online]. Available: http://link.springer.com/10.1007/978-3-319-02675-6_11 [49] M. M. Baun, N. Bergstrom, N. F. Langston, and L. Thoma, “Physiological effects of human/companion animal bond- ing,” Nursing Research, vol. 33, no. 3, pp. 126–129, 1984. [50] T. Shibata, T. Mitsui, K. Wada, A. Touda, T. Kumasaka, K. Tagami, and K. Tanie, “Mental commit robot and its application to therapy of children,” in IEEE/ASME International Conference on Advanced Intelligent Mechatronics, vol. 2, no. July. IEEE, 2001, pp. 1053–1058. [Online]. Available: http://ieeexplore.ieee.org/document/936838/ [51] K. Inoue, N. Sakuma, M. Okada, C. Sasaki, M. Nakamura, and K. Wada, “Effective application of PALRO: A humanoid type robot for people with dementia,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8547 LNCS, no. PART 1, 2014, pp. 451–454. [52] Aldebaran-Robotics, “NAOqi Framework,” 2016. [Online]. Available: http://doc.aldebaran.com/2-5/index_dev_ guide.html [53] H. Gunes, O. Celiktutan, E. Sariyanidi, and E. Skordos, “Real - time Prediction of User Personality for Social Human - Robot Interactions : Lessons Learned from Public Demonstrations *.” [Online]. Available: https://www.researchgate.net/publication/281269426 [54] R. Khosla, K. Nguyen, M.-T. Chu, and Y.-A. Tan, “Robot Enabled Service Personalisation Based On Emotion Feedback,” Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media - MoMM ’16, pp. 115–119, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3007120.3007167 [55] N. J. Nilsson, The Quest for Artificial Intelligence, 2009. [Online]. Available: http://www.cambridge.org/us/ 0521122937 [56] S. Russell and P. Norvig, Artificial Intelligence A Modern Approach, 2013. [Online]. Available: http://aima.cs. berkeley.edu/ [57] M. Berkan Sesen, A. E. Nicholson, R. Banares-Alcantara, T. Kadir, and M. Brady, “Bayesian networks for clinical decision support in lung cancer care,” PLoS ONE, vol. 8, no. 12, p. e82349, dec 2013. [Online]. Available: http://dx.plos.org/10.1371/journal.pone.0082349 [58] A. Onisko, M. J. Druzdzel, and H. Wasyluk, “A Bayesian network model for diagnosis of liver disorders,” Proceedings of the Eleventh Conference on Biocybernetics and Biomedical Engineering, vol. 2, pp. 842–846, 1999. [Online]. Available: http://www.pitt.edu/~druzdzel/psfiles/cbmi99a.pdf [59] M. Z. Bell, “Why Expert Systems Fail,” J. Opl Res. Soc, vol. 36, no. 7, pp. 613–619, jul 1985. [Online]. Available: http://www.jstor.org/stable/2582480?origin=crossref [60] C. Roe, “Machine Learning Continues its Growth,” 2015. [Online]. Available: http://www.dataversity.net/ machine-learning-continues-its-growth/ [61] C. M. Bishop, Pattern Recognition and Machine Learning, 2006, vol. 4, no. 4. [Online]. Available: http: //www.library.wisc.edu/selectedtocs/bg0137.pdf [62] J. Brownlee, “Supervised and Unsupervised Machine Learning Algorithms - Ma- chine Learning Mastery,” 2016. [Online]. Available: http://machinelearningmastery.com/ supervised-and-unsupervised-machine-learning-algorithms/ [63] S. Institute, “Machine Learning: What it is and why it matters.” [Online]. Available: https://www.sas.com/en_be/ insights/analytics/machine-learning.html

102 [64] T. Matiisen, “Demystifying Deep Reinforcement Learning,” Web Page, no. February, 2015. [Online]. Available: http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ [65] J. Andrew Bagnell, “Reinforcement Learning in Robotics: A Survey,” Springer Tracts in Advanced Robotics, vol. 97, pp. 9–67, 2014. [66] N. Welsh, “What are the advantages of reinforcement learning algorithms such as LinUCB over other online CTR prediction algorithms such as online logistic regression?” 2014. [Online]. Available: https://www.quora.com/ What-are-the-advantages-of-reinforcement-learning-algorithms-such-as-LinUCB-over-other-online-CTR-prediction-algorithms-such-as-online-logistic-regression [67] T. J. Ferman, G. E. Smith, and B. Melom, “Understanding Behavioral Changes in Dementia,” Lewy Body Dementia Association, pp. 1–19, 2008. [Online]. Available: https://www.lbda.org/content/ understanding-behavioral-changes-dementia [68] R. S. Sutton and A. G. Barto, Reinforcement learning : an introduction, 2013, vol. 9, no. 5. [69] D. Silver and G. Deepmind, “Deep Reinforcement Learning Reinforcement Learning.” [Online]. Available: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html [70] W. Li, X. Wang, R. Zhang, Y. Cui, J. Mao, and R. Jin, “Exploitation and exploration in a performance based contextual advertising system,” Proceedings of the 16th …, pp. 27–35, 2010. [Online]. Available: http://dl.acm.org/citation.cfm?id=1835811 [71] J. Langford and T. Zhang, “The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits,” Nips, pp. 1–8, 2007. [72] Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 1.5: Con- textual Bandits,” 2016. [Online]. Available: https://medium.com/emergent-future/ simple-reinforcement-learning-with-tensorflow-part-1-5-contextual-bandits-bff01d1aad9c [73] Y. Li, “Deep Reinforcement Learning: An Overview,” jan 2017. [Online]. Available: http://arxiv.org/abs/1701.07274 [74] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, feb 2015. [Online]. Available: http://www.nature.com/doifinder/10.1038/nature14236 [75] Y. Duan, J. Schulman, X. Chen, P. Bartlett, I. Sutskever, and P. Abbeel, “RLˆ2: Fast Reinforcement Learning Via Slow Reinforcement Learning,” arXiv, pp. 1–14, 2016. [Online]. Available: https://arxiv.org/pdf/1611.02779.pdf [76] Zorarobotics, “welcome to Zorabots - we make your life easier!” [Online]. Available: http://zorarobotics.be/index. php/en [77] E. R. A. Beattie, J. Song, and S. LaGore, “A Comparison of Wandering Behavior in Nursing Homes and Assisted Living Facilities,” Research and Theory for Nursing Practice, vol. 19, no. 2, pp. 181–196, 2005. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/16025697 [78] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang, “Efficient Optimal Learning for Contextual Bandits,” Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, pp. 208–214, 2011. [Online]. Available: http://arxiv.org/abs/1106.2369 [79] M. Dudík, J. Langford, and L. Li, “Doubly Robust Policy Evaluation and Learning,” Icml, p. 8, 2011. [Online]. Available: https://arxiv.org/pdf/1103.4601.pdf [80] Alison Palkhivala, “Understanding What Keeps Dementia Sufferers Awake at Night,” 2001. [Online]. Available: http://www.webmd.com/alzheimers/news/20010412/ understanding-what-keeps-dementia-sufferers-awake-at-night [81] CETpD, “FATE - Fall Detector for the Older,” pp. 3–5, 2015. [Online]. Available: http://fate.upc.edu/index.php

103 [82] Dexter Industries, “GrovePi Internet of Things Robot Kit.” [Online]. Available: https://www.dexterindustries.com/ grovepi/ [83] XETAL NV, “Xetal | The Digital Sixth Sense.” [Online]. Available: http://xetal.net/ [84] P. M. Hendryx-Bedalov, “Alzheimer’s dementia. Coping with communication decline.” Journal of gerontological nursing, vol. 26, no. 8, pp. 20–24, aug 2000. [Online]. Available: http://www.healio.com/doiresolver?doi=10.3928/ 0098-9134-20000801-06 [85] F. Van Bossche and .-p. v. Van de Weghe, Nico, “Wireless local positioning met bluetooth,” 2011. [Online]. Available: http://lib.ugent.be/catalog/rug01:001787516 [86] B. Galbraith, “Python library for Multi-Armed Bandits,” 2016. [Online]. Available: https://github.com/bgalbraith/ bandits [87] SciPy, “Obtaining NumPy,” 2017. [Online]. Available: https://www.scipy.org/scipylib/download.html# [88] Aldebaran, “Python SDK Install Guide,” 2017. [Online]. Available: http://doc.aldebaran.com/2-1/dev/python/ install_guide.html [89] J. Langford, “Vowpal Wabbit download page,” 2017. [Online]. Available: https://github.com/JohnLangford/ vowpal_wabbit/wiki/Download [90] Pydata, “Pandas Installation,” 2017. [Online]. Available: http://pandas.pydata.org/pandas-docs/version/0.19.2/ install.html [91] Plotly, “Getting Started with Plotly for R,” 2017. [Online]. Available: https://plot.ly/python/getting-started/ [92] Pypi, “h5py 2.6.0: Python Package Index,” 2017. [Online]. Available: https://pypi.python.org/pypi/h5py/2.6.0 [93] L. A. Wasser, “Hierarchical Data Formats - What is HDF5?” 2015. [Online]. Available: http://neondataskills.org/ HDF5/About [94] Influxdata, “Python client for InfluxDB,” 2017. [Online]. Available: https://github.com/influxdata/influxdb-python [95] Nodejs, “Download Node.js,” 2017. [Online]. Available: https://nodejs.org/en/download/ [96] K. Jefferies and N. Agrawal, “Early-onset dementia,” Advances in Psychiatric Treatment, vol. 15, no. 5, pp. 380–388, 2009. [Online]. Available: http://apt.rcpsych.org/content/15/5/380 [97] L. F. Da Silva Talmelli, F. D. A. Carvalho Do Vale, A. C. M. Gratão, L. Kusumota, and R. A. P. Rodrigues, “Doença de Alzheimer: Declínio funcional e estágio da demência,” ACTA Paulista de Enfermagem, vol. 26, no. 3, pp. 219–225, 2013. [98] M. Notes, “The timely diagnosis of dementia,” 2011. [Online]. Available: https://thinkgp.com.au/sites/default/ files/tincan_files/31892/story_content/external_files/dementia_notes.html [99] C. G. Atkeson, C. G. Atkeson, and J. C. Santamaria, “A Comparison of Direct and Model-Based Reinforcement Learning,” IN INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, pp. 3557—-3564, 1997. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.956

104 Appendices

105 A REST API

106 107 B ADDITIONAL RESULTS

This section defines some small additional tests which were performed before the real simulation tests defined in Chapter 8.

B.1 Hyper-Parameter Selection

Policies and agents have tunable parameters. This section gives additional information of how these parameters were optimised.

B.1.1 UCB

The UCB policy has a tunable confidence level parameter c. Figure B.1 summarises a UCB parameter selection test for a single virtual PwD for an optimistic strategy. The first graph in this figure visualizes the total cumulative regret after 100 interventions for different values of the c-parameter in the UCB policy. For this test-case, an optimal value is already found for C=0.01.

Figure B.1: Detailed overview for both a gradient and normal agent when fine-tuning the hyper-parameters of the UCB policy for an optimal strategy.

108 B.1.2 Epsilon-Greedy

The Epsilon-Greedy policy has a tunable parameter ϵ, which defines the degree of exploring and exploiting the action space. Figure B.2 summarizes a Epsilon-Greedy parameter selection test for a single virtual PwD and for an optimistic strategy. The first graph in this figure visualizes the total cumulative regret after 100 interventions for different values of the ϵ parameter in the Epsilon-greedy policy. For this test-case, an optimal value is found for ϵ = 0.2.

Figure B.2: Detailed overview for a normal agent when fine-tuning the hyper-parameters of the Epsilon-greedy policy for an optimal strategy.

109