<<

USING FACIALLY EXPRESSIVE TO INCREASE REALISM IN PATIENT SIMULATION

A Dissertation

Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by Maryam Moosaei

Laurel D. Riek, Director

Graduate Program in Computer Science and Engineering Notre Dame, Indiana July 2017 USING FACIALLY EXPRESSIVE ROBOTS TO INCREASE REALISM IN PATIENT SIMULATION

Abstract by Maryam Moosaei

Social robots are a category of robots that can have social interaction with humans. The presence of these robots is growing in different fields such as healthcare, entertain- ment, assisted living, rehabilitation, and education. Within these domains, human interaction (HRI) researchers have worked on enabling social robots to interact more natu- rally with people using different verbal and nonverbal channels. Considering that most of human-human interactions are non-verbal, it is worth considering how to enable robots to both recognize and synthesize non-verbal behavior. My research focuses on synthesizing facial expressions, a type of nonverbal behavior, on social robots. During my research, I developed several new algorithms for synthe- sizing facial expressions on robots and avatars, and experimentally explored how these expressions were perceived. I also developed a new control system for operators which automates synthesizing expressions on a robotic head. Additionally, I worked on building a new robotic head, capable of expressing a wide range of expressions, which will serve as a next generation of patient simulators. My work explores the application of facially expressive robots in patient simulation. Robotic Patient Simulators (RPS) systems are the most frequent application of humanoid robots. They are human-sized robots that can breathe, bleed, react to medication, and convey hundreds of different biosignals. RPSs enable clinical learners to safely practice Maryam Moosaei clinical skills without harming real patients, and can include: patient communication, pa- tient condition assessment, and procedural practice. Although commonly used in clinical education, one capability of RPSs is in need of at- tention: enabling face-to-face interactions between RPSs and learners. Despite the impor- tance of patients’ facial expressions in making diagnostic decisions, commercially avail- able RPS systems are not currently equipped with expressive faces. They have static faces that cannot express any visual pathological signs or distress. Therefore, clinicians can fall into the poor habit of not paying attention to a patient’s face, which can result in them going down an incorrect diagnostic path. One motivation behind my research is to make RPSs more realistic by enabling them to convey realistic, clinically-relevant, patient-driven facial expressions to clinical trainees. We have designed a new type of RPS with a wider range of expressivity, including the ability to express pain, neurological impairment, and other pathologies in its face. By de- veloping expressive RPSs, our work serves as a next generation educational tool for clinical learners to practice face-to-face communication, diagnosis, and treatment with patients in simulation. As the application of robots in healthcare continues to grow, expressive robots are another tool that can aid the clinical workforce by considerably improving the realism and delity of patient simulation. During the course of my research, I completed four main projects and designed several new algorithms to synthesize different expressions and pathologies on RPSs. First, I designed a framework for generalizing synthesis of facial expressions on robots and avatars with different degrees of freedom. Then, I implemented this framework as an ROS-based module and used it for performing facial expression synthesis on different robots and virtual avatars. Currently, researchers cannot transfer the facial expression syn- thesis software developed for a particular robot to another robot, due to having different hardware. Using ROS, I developed a general solution that is one of the first attempts to address this problem. Maryam Moosaei

Second, I used this framework to synthesize patient-driven facial expressions of pain on virtual avatars, and studied the effect of an avatar’s gender on pain perception. We found that participants were able to distinguish pain from commonly conflated expressions (anger and disgust), and showed that patient-driven pain synthesis can be a valuable alter- native to laborious key-frame animation techniques, This was one of the first attempts to perform automatic synthesis of patient driven expressions on avatars without the need of an animation expert. Automatic patient driven facial expression synthesis will reduce the time and cost needed for operators of RPS systems. Third, I synthesized patient-driven facial expressions of pain on a and its virtual model, with the objective of exploring how having a clinical education can affect pain detection accuracy. We found that clinicians have lower overall accuracy in detecting synthesized pain in comparison with lay participants. This supported other findings in the literature, showing that there is a need to improve clinical learners’ skills in decoding ex- pressions in patients. Furthermore, we explored the effect of embodiment (robot, avatar) on pain perception by both clinicians and lay participants. We found that all participants are overall less accurate in detecting pain from a humanoid robot in comparison to a compa- rable virtual avatar. Considering these effects are important when using expressive robots and avatars as an educational tool for clinical learners. Fourth, I developed a computational (mask) model of atypical facial expressions such as stroke and Bell’s Palsy. We used our developed computational model to perform masked synthesis on a virtual avatar and ran an experiment to evaluate the similarity of the syn- thesized expressions to real patients. Additionally, we used our developed mask model to design a shared control system for controlling an expressive robotic face. My work has multiple contributions for both the HRI and healthcare communities. First, I explored a novel application of facially expressive robots, patient simulation, which is a relatively unexplored area in the HRI literature. Second, I designed a generalized framework for synthesizing facial expressions on robots and avatars. My third contribu- Maryam Moosaei tion was to design new methods for synthesizing naturalistic patient-driven expressions and pathologies on RPSs. Experiments validating my approach showed that the embodi- ment or gender of the RPS can affect perception of their expressions. Fourth, I developed a computational model of atypical expressions and used this model to synthesize Bell’s palsy on a virtual avatar. This model can be used as part of an educational tool to help clinicians improve their diagnosis of conditions like Bell’s palsy, stroke, and Parkinson’s disease. Finally, I used this mask model to develop a shared control system for controlling a robotic face. This control model automates the job of controlling an RPS’s expressions for simulation operators. My work enables the HRI community to explore new applications of social robots and expand their presence in our daily lives. Moreover, my work will inform the design and development of the next generation of PRSs that are able to express visual signs of diseases, distress, and pathologies. By providing better technology, we can improve how clinicians are being trained and this will eventually improve patient outcomes. CONTENTS

FIGURES ...... vi

TABLES ...... viii

ACKNOWLEDGMENTS ...... ix

CHAPTER 1: INTRODUCTION ...... 1 1.1 Motivation and Scope ...... 1 1.2 Contribution ...... 3 1.3 Publications ...... 4 1.4 Ethical Procedures ...... 5 1.5 Dissertation Overview ...... 5

CHAPTER 2: BACKGROUND ...... 7 2.1 Emotional Space ...... 9 2.2 Common Techniques for Synthesizing Facial Expressions ...... 11 2.2.1 Facial Feature Tracking and Extraction ...... 11 2.2.2 Keyframe Synthesis ...... 13 2.2.3 Performance Driven Synthesis ...... 14 2.3 Synthesizing Facial Expressions on Patient Simulators ...... 17 2.4 Chapter Summary ...... 18

CHAPTER 3: EVALUATING FACIAL EXPRESSION SYNTHESIS ON ROBOTS 20 3.1 Background ...... 20 3.1.1 Subjective Evaluation ...... 22 3.1.2 Expert Evaluation ...... 25 3.1.3 Computational Evaluation ...... 26 3.2 Proposed Method Based on Computing Subjective and Computational Eval- uation Methods ...... 27 3.3 Challenges in Synthesis Evaluation ...... 29 3.4 Chapter Summary ...... 31

ii CHAPTER 4: PAIN SYNTHESIS ON VIRTUAL AVATARS ...... 32 4.1 Background ...... 34 4.2 Methodology ...... 35 4.2.1 Source Video Acquisition ...... 36 4.2.2 Feature Extraction ...... 37 4.2.3 Avatar Model Creation ...... 37 4.2.4 Stimuli Creation and Labeling ...... 39 4.2.5 Main Experiment ...... 41 4.3 Results ...... 42 4.3.1 Summary of Key Findings ...... 42 4.3.2 Regression Method ...... 42 4.4 Discussion ...... 46 4.5 Chapter Summary ...... 47 4.6 Acknowledgment of Work Distribution ...... 48

CHAPTER 5: PAIN SYNTHESIS ON ROBOTS ...... 49 5.1 Introduction ...... 49 5.1.1 Our Work and Contribution ...... 49 5.2 Methodology ...... 52 5.2.1 Background ...... 52 5.2.2 Overview of Our Work ...... 54 5.2.3 Source Video Acquisition ...... 55 5.2.4 Feature Extraction ...... 57 5.2.5 Humanoid Robot ...... 57 5.2.6 Avatar Model Creation ...... 58 5.2.7 Stimuli Creation and Labeling ...... 58 5.2.8 Pilot Study ...... 60 5.2.9 Main Experiment ...... 62 5.3 Results ...... 64 5.3.1 Regression Method ...... 64 5.3.2 ANOVA Method ...... 67 5.4 Discussion ...... 69 5.5 Chapter Summary ...... 72 5.6 Acknowledgment of Work Distribution ...... 73

CHAPTER 6: DESIGNING AND CONTROLLING AN EXPRESSIVE ROBOT . 74 6.1 Proposed Method ...... 75 6.1.1 ROS Implementation ...... 76 6.1.2 An Example of Our Method ...... 78 6.2 Validation ...... 82 6.2.1 Simulation-based Evaluation ...... 82 6.2.2 Physical Robot Evaluation ...... 85 6.3 General Discussion ...... 87

iii 6.4 Chapter Summary ...... 88 6.4.1 Acknowledgment of Work Distribution ...... 89

CHAPTER 7: MASKED SYNTHESIS ...... 90 7.1 Background ...... 91 7.2 Impact ...... 95 7.3 Research Questions ...... 96 7.4 Acquire Videos of Patients with Stroke and BP and Extract Facial Features Using CLMs ...... 97 7.5 Build Computational Models of Stroke and Bell’s palsy ...... 99 7.5.1 Find the Scaling Parameters βi for Bell’s palsy ...... 101 7.6 Test and Validate the Developed Mask Models ...... 105 7.6.1 Source Video Acquisition ...... 106 7.6.2 Feature Extraction ...... 107 7.6.3 Stimuli Creation ...... 107 7.6.4 Exploratory Case Study ...... 107 7.6.5 Data Analysis ...... 109 7.6.6 Results ...... 110 7.6.7 Discussion ...... 113 7.7 Develop a Shared Control System with Different Functionality Modes Us- ing the Models from RQ1 ...... 114 7.7.1 Manual Control ...... 114 7.7.2 Live Puppeteering ...... 115 7.7.3 Pre-recorded Puppeteering ...... 116 7.7.4 Masked Synthesis ...... 116 7.8 Chapter Summary ...... 117 7.9 Acknowledgment of Work Distribution ...... 118

CHAPTER 8: CONCLUSION ...... 119 8.1 Contributions ...... 119 8.1.1 Developed a Method for Evaluating Facial Expression Synthesis Algorithms on Robots and Virtual Avatars ...... 119 8.1.2 Developed an ROS-based Module for Performing Facial Expres- sion Synthesis on Robots and Virtual Avatars ...... 120 8.1.3 Synthesized Patient Driven Facial Expressions on Patient Simulators121 8.1.4 Investigated Usage of Facially Expressive Robots to Calibrate Clin- ical Pain Perception ...... 122 8.1.5 Created a Computational Model of Atypical Facial Expressions . 123 8.2 Future Work ...... 124 8.2.1 Design a Control Interface for Our Shared Control Model . . . . 124 8.2.2 Conduct Additional Experiments to Evaluate Masked Synthesis and Shared Control Model ...... 124

iv 8.2.3 Design New Clinical Educational Paradigms That Include Practic- ing Face-to-Face Communication Skills Using Our Bespoke Robotic Head ...... 125 8.2.4 Synthesize Speech for Our Bespoke Robotic Head ...... 126 8.2.5 Improving Clinical Educators’ Cultural Competency Using Simu- lation with Our Bespoke Robotic Head ...... 126 8.3 Open Questions ...... 127 8.3.1 Is Clinicians’ Lower Accuracy in Overall Facial Expression De- tection a Result of the Decline in Empathy or Practicing with In- expressive Patient Simulators? ...... 127 8.3.2 What Is the Long Term Effect of Practicing with Expressive Patient Simulators on Patient Outcomes and Preventable Harm? . . . . . 128 8.3.3 Is It Possible to Make Robotic Patient Simulators Fully Autonomous?128 8.4 Closing Remarks ...... 129

BIBLIOGRAPHY ...... 130

v FIGURES

2.1 Examples of robots with facial expressivity used in HRI research with varying degrees of freedom. A: EDDIE, B:Sparky, C: Kismet, D: , E: M3-Synchy, F: Bandit, G: BERT2, H: , F: Flobi, K: Diego, M: ROMAN, N: Eva, O: Jules, P: Geminoid F, Q: , R: Repliee Q29 2.2 A humanoid robot with 32 motors which are positioned based on FACS [116]...... 10 2.3 Left : two dimensional emotion space defined by Arousal and Valence. Right: three dimensional emotion space defined by Arousal, Valence and Stance...... 11 2.4 Right: an example of markerless citemcmaster face tracking, left: an ex- ample of marker based face tracking [183]...... 15 2.5 Left: A team of clinicians treat a simulated patient, who is conscious dur- ing the simulation, but has no capability for facial expression. Right: A commonly used inexpressive mannequin head...... 18

3.1 Subjective evaluation example from Becker-Asano and Ishiguro [18]; left: Geminoid F (left) and its model person (right), right: the interface. . . . . 24 3.2 Subjective evaluation example from Mazzei et al. [116]...... 25 3.3 An example of a side-by-side comparison. On the left is the real image, and on the right is the synthesized [190]...... 26 3.4 E-smiles-creator [133]...... 28

4.1 The spectrum of avatar genders we created, with the final three highlighted. From right to left: male, androgynous, female...... 38 4.2 Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. The pain source videos are from the UNBC- McMaster pain archive [113]; the others are from the MMI database [137]. 40

5.1 Sample frames from the BP4D database. From left to right: disgust, hap- piness, pain, anger...... 56 5.2 Left: the created avatar model similar to PKD, right: PKD robot...... 59 5.3 The CLM-based face tracker extracts 66 facial features...... 60

vi 5.4 Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. From left to right: pain, anger, disgust. From top to bottom: source videos, PKD robot, and avatar...... 61

6.1 Overview of our proposed method...... 76 6.2 Left: the 66 facial feature points of CLM face tracker, Right: an exam- ple robotic eyebrow with one degree of freedom and an example robotic eyeball with two degrees of freedom ...... 80 6.3 Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. The pain source videos are from the UNBC- McMaster pain archive [113]; the others are from the MMI database [137]. 85

7.1 The setting for performing masked synthesis...... 93 7.2 An example of a control station at a simulation center...... 94 7.3 The facial weakness in BP usually involves the eyes, mouth, and forehead. 98 7.4 Sample frames from a source video presenting different facial expressions of a patient with Bell’s palsy...... 99 7.5 Assuming the person did not have atypical expressions, I calculated the 2D coordinates of feature points on the affected side of the face...... 103 7.6 Sample frames from three patients with Bell’s palsy. We used data from these patients to develop our three mask models...... 104 7.7 Sample frames from the stimuli videos for the first part of the study. Upper row shows unmasked expressions and lower row shows masked expres- sions. From left to right: raising cheeks, furrowing brow, closing eyes, smiling, and raising eyebrows...... 109 7.8 An example of a manual control interface for a virtual avatar from the Steam Source SDK...... 115 7.9 Live puppeteering algorithm in our shared control system...... 116

vii TABLES

4.1 FREQUENCIES AND PERCENTAGES OF HITS AND ERRORS IN THE MAIN STUDY ...... 44 4.2 OMNIBUS TESTS OF MODEL COEFFICIENTS ...... 45 4.3 VARIABLES IN THE REGRESSION EQUATION ...... 45

5.1 FREQUENCIES AND PERCENTAGES OF HITS AND ERRORS IN THE MAIN STUDY ...... 66 5.2 OMNIBUS TESTS OF MODEL COEFFICIENTS ...... 67 5.3 INDEPENDENT VARIABLESIN THE REGRESSION EQUATION AND THEIR INDIVIDUAL EFFECTS ON THE CLASSIFICATION TASK . . 68 5.4 UNIVARIATE ANALYSIS OF VARIANCE(TESTS OF BETWEEN SUB- JECTS EFFECTS) ...... 68

6.1 THE FACIAL PARTS ON THE ROBOT, AND CORRESPONDING SERVO MOTORS, AND CLM TRACKER POINTS ...... 83 6.2 FULL RESULTS FOR EACH OF THE 10 FACE-SECTION EXPRES- SIONS ...... 87

7.1 FULL RESULTS FOR EACH FACIAL PART AND THE ENTIRE FACE FOR THE DIAGNOSTIC STIMULI VIDEOS ON A 4-POINT DVAS . . 111 7.2 FULL RESULTS FOR SYNTHESIZED NATURALISTIC EXPRESSIONS ON A 4-POINT DVAS ...... 112

viii ACKNOWLEDGMENTS

Many people have supported me during my time as a PhD student, and I would like to thank them all. First and foremost, I would like to thank my parents for their countless sacrifices to pro- vide me the best possible education. They never stopped encouraging me and my brother to continue our education, even when each of us were attending graduate school in different continents. I am forever thankful to them for being always supportive, especially during all the years that I have spent far away from home. Of course, all of the work in this dissertation would not have been possible without the guidance of my advisor, Dr. Laurel Riek. She has contributed significantly in guiding my research and training me as a researcher. I thank her for her constant support and for giving me this amazing opportunity. I believe that as a researcher, I have grown up in many ways that would not have happened without her. I also thank all of my friends and labmates here at Notre Dame, particularly Michael J. Gonzales, Cory C. Hayes, Angelique Taylor, Darren Chan, and Tariq Iqbal who were with me during both the good and very stressful times in graduate school. I greatly appreciate the University of Notre Dame for the sense of community and support throughout its entire campus and helping students to stay motivated. I thank Margot Hughan, from the Mechanical Engineering department at the University of Notre Dame for contributing significantly to designing and building the hardware parts for our bespoke robotic head. I also thank Dr. Richard Strebinger, Associate Professional Specialist in the Mechanical Engineering department at the University of Notre Dame, for providing guidance on mechanical design of the robot skull. He has been a valuable mentor

ix throughout the project. I would also like to thank my committee members for taking the time to review this document and provide valuable feedback: Dr. Patrick J. Flynn, Dr. Jane Cleland-Huang, Dr. Bill Smart, and Dr. Ronald Metoyer. Finally, this material is based upon work supported by the National Science Foundation under Grant No. IIS-1253935, Dr. Riek’s NSF CAREER award entitled “Next Generation Patient Simulators”.

x CHAPTER 1

INTRODUCTION

1.1 Motivation and Scope

Robotics as a field is expanding into many new sectors, such as healthcare, enter- tainment, education, assisted living, rehabilitation, and transportation. As robots find a stronger presence in our daily lives, their effect on society increases, which opens up new areas of research, including human robot interaction (HRI), robot ethics, and robot percep- tion. One area of particular interest is human-robot collaboration (HRC), where humans and robots will be working together in close proximity [85, 88]. Ideally, we would like robots to be capable partners, able to perform tasks independently and to effectively communicate their intentions toward us. A number of researchers have successfully designed robots in this area, including museum-based robots that can provide tours [51], nurse robots that can automatically record a patient’s bio-signals and report the results [110], wait staff robots which can take orders and serve food [57], and toy robots which entertain and play games with children [172]. To facilitate HRC, it is vital that robots have the ability to convey their intention during interactions with people. In order for robots to appear more approachable and trustworthy, researchers must create robot behaviors that are easily decipherable by humans. These be- haviors will help express a robot’s intention, which will facilitate understanding of current robot actions or the prediction of actions that a robot will perform in the immediate future. Additionally, allowing a person to understand and predict robot behavior will lead to more efficient social interactions [61, 65, 76, 175].

1 HRI researchers have explored the domain of expressing robot intention by synthe- sizing robot behaviors, particularly those expressible through body movements, that are human-like and therefore more readily understandable [5, 52, 100, 129]. While many robots like the PR2 [173] are highly dexterous and can express intention through a wide range of body movements, one noticeable limitation is that there are some subtle cues that cannot easily be expressed without at least some facial features, such as confusion, frustration, boredom, and attention [58]. Indeed, the human face is a rich spontaneous channel for the communication of social and emotional displays and serves an important role in human communication. Thus, robot behavior that includes at least some rudimen- tary, human-like facial expressions can enrich the interaction between humans and robots and add to a robot’s ability to convey intention [122]. Robots able to convey facial expressions can be used in several domains. Several ap- plications for facially expressive robots and virtual avatars include entertainment (e.g. film animation and video games), education (e.g. virtual learning and teaching), clinical and natural biomedical studies (e.g. patient simulation and autism therapy). In my work, I focus on a specific clinical application of facially expressive robots: patient simulation. Practicing with patient simulators before treating real patients is very common in the United States for learning different clinical skills, such as communication, assessment of patient condition, and procedural practice. Despite the importance of pa- tients’ facial expressions in making diagnostic decisions, commercially available patient simulators are not equipped with an expressive face. My work focuses on filling this tech- nology gap by applying our knowledge to make patient simulators more realistic. This dissertation addresses this technology gap in patient simulation through an explo- ration of the following research questions.

• Can we build a facially expressive robotic head that is suitable for use in a clinical setting as an expressive patient simulator?

• How can we automatically synthesize a wide range of patient driven expressions and pathologies on humanoid robots and virtual avatars?

2 • How do people (clinicians and lay) perceive the synthesized expressions on patient simulators?

• How can we develop a computational model of atypical facial expressions and use them to perform masked synthesis on robots and avatars?

• How can we develop a shared control system for simulation operators to automati- cally control a robot’s face?

1.2 Contribution

The primary contributions of this dissertation are as follows:

• Developed a method for evaluating different facial expression synthesis algo- rithms on robots and virtual avatars. A standard evaluation method is necessary to ensure that the expressions synthesized using our algorithms are perceived as in- tended. Currently, subjective evaluation and computationally-based evaluations are the two most commonly used methods in HRI for evaluating different facial ex- pression synthesis methods. I identify the advantages and weaknesses of these two methods and describe a method that combines the advantages of both.

• Worked on building an expressive robotic head, as a next generation patient simulator. We have been building our own bespoke robotic head with 16 degrees of freedom on its face, with a wide range of expressivity, including the ability to express pain, neurological impairment, and other pathologies in its face. By developing expressive robotic patient simulators (RPSs), our work serves as a next generation educational tool for clinical learners to practice face-to-face communication with patients. As the application of robots in healthcare continues to grow, expressive robots are another tool that can aid the clinical workforce by considerably improving the realism and delity of patient simulation.

• Developed an automatic module for performing facial expression synthesis on robots and avatars. We developed an ROS-based module for performing facial ex- pression synthesis on a wide range of robots and virtual avatars. This module which is the software of our bespoke robotic head is robust, not limited by or requiring a specific number of degrees of freedom. HRI and clinical researchers can use this open-source software system to perform low-cost, realistic facial expression synthe- sis on their robots and avatars.

• Synthesized patient-driven facial expressions on a humanoid robot and an avatar. While finishing the hardware of our robot, we began to test our algorithms in simu- lation with an avatar as well as another humanoid robot. We synthesized naturalistic painful facial expressions on a humanoid robot and its virtual model, and answered various research questions regarding their perception by both lay participants and clinicians.

3 • Developed a mask model for atypical expressions. We developed a computational model of atypical expressions and used this model to synthesize Bell’s palsy on a virtual avatar. To validate our model, we compared the similarity of the avatar’s expressions to real patients.We also used our developed model to build a new shared control system for operators of expressive robots and virtual avatars.

1.3 Publications

Some of the work presented in this proposal is based on the following publications. In addition to these publications, some of the work in this thesis is also based on papers that have yet to be published.

• Moosaei, M., Das, S.K., Popa, D.O., and Riek, L.D. “Using Facially Expressive Robots to Calibrate Clinical Pain Perception”, In Proc. of the 12th ACM/IEEE Inter- national Conference on Human Robot Interaction (HRI), 2017. [Best Paper Award Nominee][Acceptance Rate: 24%]

• Hayes, C.J., Moosaei, M., and Riek, L.D. “Exploring Implicit Human Responses to Robot Mistakes in a Learning from Demonstration Task”. IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2016), pp. 1-6, 2016.

• Iqbal, T., Moosaei, M., and Riek, L.D. “Tempo Adaptation and Anticipation Meth- ods for Human-Robot Teams”. In Proceedings of the Robotics: Science and Systems (RSS), Planning for Human-Robot Interaction: Shared Autonomy and Collaborative Robotics Workshop, 2016.

• Moosaei, M., and Riek, L.D. “Pain Synthesis for Humanoid Robots”, Midwest Robotics Workshop (MWRW) at the Toyota Technical Institute of Chicago (TTIC), March 2016.

• Moosaei, M., Hayes, C.J., and Riek, L.D. “Performing Facial Expression Synthesis on Robot Faces: A Real-time Software System”, In Proceedings of the 4th Inter- national AISB Symposium on New Frontiers in Human-Robot Interaction, AISB, 2015.

• Moosaei, M., Hayes, C.J., and Riek, L.D. “Facial Expression Synthesis on Robots: An ROS Module.” Proceedings of the Tenth Annual ACM/IEEE International Con- ference on Human-Robot Interaction Extended Abstracts, 2015.

• Moosaei, M., Gonzales, M.J., and Riek, L.D. “Naturalistic Pain Synthesis for Virtual Patients”, In Proceedings of the 14th International Conference on Intelligent Virtual Agents (IVA), 2014. [Acceptance Rate: 17%].

4 • Gonzales, M.J., Moosaei, M., and Riek, L.D. “A Novel Method for Synthesizing Naturalistic Pain on Virtual Patients”, In Proceedings of the International Journal of Simulation in Healthcare, Vol. 8, Issue 6. [Best Overall Paper and Best Student Paper, Technological Innovation, IMSH 2014].

• Moosaei, M. and Riek, L.D. “Evaluating Facial Expression Synthesis on Robots”, In Proceedings of the HRI Workshop on Applications for Emotional Robots at the 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2013.

1.4 Ethical Procedures

Human subject experiments described in this dissertation were formally reviewed by the Institutional Review Board (IRB) at the University of Notre Dame. The IRB found the experiments straightforward to review as they did not involve minors, cause any harm to participants, or alter anything with regard to participants’ normal working environments. Informed consent was collected from all participants in all experiments. Participants in all the experiments were compensated $5-10 for their participation in our studies. All participant data was stored securely and was anonymized.

1.5 Dissertation Overview

This dissertation is organized as follows:

• Chapter 2: presents different methodologies and concepts for synthesizing facial expressions on robots, and describes the motivation behind my research.

• Chapter 3: discusses our proposed method for evaluating synthesized facial expres- sions on robots by incorporating both subjective as well as computationally-based evaluation techniques.

• Chapter 4: presents our work on synthesizing patient-driven facial expressions of pain on virtual avatars and on studying the effect of the avatar’s gender on pain perception.

• Chapter 5: presents our work on synthesizing facial expressions of pain on a hu- manoid robot. It also describes our study on how a clinical education background affects pain perception.

5 • Chapter 6: outlines development of the software for our bespoke robotic head, which includes an ROS module for synthesizing facial expressions on a wide range of robots and virtual avatars.

• Chapter 7: presents building a computational model of atypical facial expressions and use this model to perform masked synthesis on a virtual avatar. It also describes developing a shared control system for controlling a robotic head.

• Chapter 8: describes the main contributions of this dissertation and discusses plans for future work. It also outlines key open questions that need to be addressed in order to improve realism in patient simulation by using facially expressive robots.

6 CHAPTER 2

BACKGROUND

Robotics is a rapidly growing industry. Robots are becoming more common in our daily lives, and will continue to do so over the next decade. In the near future, robots and people will be working side-by-side, e.g., will have proxemic interaction [86, 87, 144, 145]. There is a desire among robotics researchers and designers to enable them to communicate quickly and effectively with humans [78, 79, 176]. While there is some debate in the HRI community about the most appropriate and ad- equate form of communication between robots and humans (e.g. human like communica- tion, non-human like), non-verbal behaviors seem to be a very fast way of communicating between humans [145]. Humans are expert and very fast at reading each other’s nonverbal behaviors. Therefore, it is worth thinking about enabling robots to express human like non-verbal behaviors. Humans employ several kinds of informative signals for communicating with each other, including changing the prosody of their voice, performing gestures, shifting their posture, signaling with their head and eye movements, and creating facial expressions. Of all of these communication modalities, the face is the most spontaneous channel for the rapid communication of social and emotional displays. Facial expressions can be used to enhance conversation, show empathy, and acknowledge the actions of others [13]. They can be used to convey not only basic emotions such as fear, sadness, and disgust, but also complex cognitive mental states, such as confusion, contemplation, and nostalgia. Having a human-like face may allow robots to communicate and interact with humans more effectively [108]. In contrast, robots with static faces and uniform facial expressions

7 may be less engaging and seem unnatural. Robots with realistic facial movements can be useful across several domains such as: film animation, teleconferencing, patient simula- tion, virtual collaboration work, virtual learning and teaching, and clinical and biomedical studies (e.g., stress/pain monitoring and autism therapy). To synthesize a natural face for robots, one may need to model human facial expres- sions; head and eye movements; gaze behaviors and blinking frequency; teeth, lip, and tongue motion; and even skin. There are several issues that make facial expression synthe- sis complicated. The face has a complicated geometry and each part, including the texture, color, and wrinkles, plays a role in communicating an expression. Facially, expressive robots are usually built based on a human face structure. As figure 2.1 shows, facially expressive robots vary considerably in their morphology, ranging from toy-like faces to childlike faces to very realistic adult human faces [17, 77, 130, 162]. A robot face might be as simple as having two eyes with moveable eyelids, tilting eyebrows as well as upper and lower lips that can be moved up or down from the center of the mouth, see figure 1.B. Such a robot is able to show basic simple expressions like happiness and sadness. However, as we move toward more human like robot faces in figure 2.1, deeper knowledge about structure and functions of a human face and its muscle functions is required. The Facial Action Coding System (FACS) proposed by Ekman is a guide for studying the human facial expressions based on the anatomical properties of the face. Action Units (AU) are the building blocks of FACS. Each AU refers to a single muscle or a combination of several muscles. FACS tries to distinguish different movements of the face based on activity of facial muscles. Ekman writes that each facial expression is generated by one or more AUs [55]. For devising a facial expression synthesis system based on muscle actions, it is vital to identify and classify AUs. The group of AUs that are activated for each of the facial expressions is known in the literature. In designing a human like face, roboticists try

8 Figure 2.1. Examples of robots with facial expressivity used in HRI research with varying degrees of freedom. A: EDDIE, B:Sparky, C: Kismet, D: Nao , E: M3-Synchy, F: Bandit, G: BERT2, H: KOBIAN, F: Flobi, K: Diego, M: ROMAN, N: Eva, O: Jules, P: Geminoid F, Q: Albert HUBO , R: Repliee Q2 to position the servo motors based on the AUs according to FACS. Figure 2.2 shows an example of a robot with 32 motors on its face that are positioned based on action units [116].

2.1 Emotional Space

Ekman’s FACS led to a theory of categorical expressions. Based on this theory, all of the expressions are combinations of happiness, sadness, anger, fear, disgust, and surprise [55]. Some robots have been designed based on the categorical nature of expressions [38,

9 Figure 2.2. A humanoid robot with 32 motors which are positioned based on FACS [116].

189]. Expressivity of a robotic face is commonly evaluated by measuring the recognition rate of Ekman’s six basic emotions [99, 165]. Russell and Fernandez-Dols suggested another theory for explaining emotions with a two-dimensional space that has valence and arousal as its axes [158]. Schiano et al. proposed that a three-dimensional space is the most robust model of facial expressions [163]. Breazeal referred to this third axis as “stance”, and used it to present her Kismet robot [31]. Breazeal suggested that the role of the third dimension is to disambiguate surprise, fear, disgust, and anger. For both human and robotic faces, these are expressions that are difficult to recognize [14, 17]. Breazeal proposed that the stance dimension is defined by two expressions: “openness” in the positive direction and “closedness/sternness” in the negative direction [33]. In this way anger, fear, and disgust are given their own location in the affect space. They all have negative valence and high arousal which is why the third axis is required to distinguish them. Figure 2.3 shows a two dimensional emotional space versus a three dimensional emo- tional space. Each emotion in this three-dimensional space is a vector (A,V,S) (Arousal,

10 Figure 2.3. Left : two dimensional emotion space defined by Arousal and Valence. Right: three dimensional emotion space defined by Arousal, Valence and Stance.

Valence, Stance). This vector shows a point in the emotional space and the neutral face is the origin of this space. Also, each of the six basic facial expressions is a point in this cubical space.

2.2 Common Techniques for Synthesizing Facial Expressions

2.2.1 Facial Feature Tracking and Extraction

In most facial expression synthesis systems, there is a need to detect facial points and track them over time. Tracking, extraction, and recognition might be generally useful to serve two different purposes. The first purpose is to enable a robot to imitate expressions of a performer’s face [73, 130]. This purpose requires that the robot detects the user’s face, tracks the face over time, extracts facial points, and translates these features to its own servo movements. The second purpose is to enable an interacting robot to identify the user’s mode and show one of its predefined expressions with an equivalent mode. Basically, in the sec- ond case, tracking and extraction algorithms are not used to synthesize expressions on the

11 robot’s face but for detecting the user’s mood. Here the robot interacts with users and per- ceives input information from the user via a camera [17, 34]. In this scenario, the robot detects a user’s face, tracks it over time, extracts features, and detects the user’s emotional state or mode (e.g. happy, sad, etc.). Then, based on the perceived emotion, the robot shows one of the limited number of its predefined expressions (e.g. smiles). Tracking can be achieved with a high degree of accuracy by placing markers on the user’s face [183]. However, from a practical perspective, this can be invasive, time con- suming, and cumbersome. The ideal facial feature tracking system is markerless, fully automatic, and non-invasive. The two most common methods used for markerless auto- matic facial feature extraction are geometry-based and model based. Geometry-based approaches involve tracking some facial feature points. In these ap- proaches, one tries to model the shape of the face by choosing some critical points on the face and extracting geometric features from them. First attempts for facial expression recognition and synthesizing were based on geometric approaches [136]. Model based approaches such as Active Appearance Models (AAMs) and Constrained Local Models (CLMs) are other commonly used techniques and are an improvement on both the accuracy and computational efficiency of geometric-based approaches. Active Appearance models (AAM) are statistical methods for matching the model of a user’s face to an unseen face [44]. In this method, the shape of the face is estimated by labeling some feature points on some facial images in the training set [4]. Constrained Local Models (CLM) are person-independent techniques for facial feature tracking similar to AAMs except CLMs do not requires manual labeling [47]. There are also several extensions of CLM [47]. Baltrusitus et al. introduced a 3D Constrained Local Model (CLM-Z) for detecting facial features and tracking them using a depth camera, such as Kinect [12]. Baltrusitus et al. also proposed a new tracking algorithm based on a combination of CLM-Z and a generalized adaptive view-based appearance model (GAVAM) which combines rigid and non-rigid facial tracking [12, 125].

12 In the training step of most tracking algorithms, point distribution models (PDM) are used. This consists of non-rigid shape and rigid global transformation parameters, and tries to learn the model of some labeled samples. Principal component analysis (PCA) is the most common approach used in this step. In the test phase, a fitting process is used to estimate the rigid and non-rigid parameters that produce an unseen face image. Constrained Local Model (CLM) techniques also use this general approach, but like active shape model approaches they only try to model the appearance of the local patches around the landmarks, not the whole image [47].

2.2.2 Keyframe Synthesis

Keyframe synthesis, also known as interpolation animation, is the conventional method of animation. Keyframe synthesis basically means hand-made emotions [40]. In this tech- nique, the expressions are synthesized at specified points in time by manually adjusting the servo motors of a robot and interpolation is used for intermediate frames. Researchers may also ask a psychologist to supervise making the expressions, especially more complicated emotions like fatigue and depression. This method has been used commonly in the literature for designing interactive robots. For a given robot, researchers adjust the robot’s degrees of freedom to make different facial expressions and save these states for the robot. Later, when the robot is interacting with users, these expressions will be activated based on the interaction mode [34, 162]. An example is a tour guide robot in a museum that smiles to the visitors or Sparky that can go around a room and smile to people [162]. Furthermore, this method has been extensively used in the literature to study differ- ent aspects of perception of robotic facial expressions [18, 38, 165]. In this case, the researchers manually adjust the servo motors to synthesize different expressions on the robot’s face. By showing the robot or recorded data from it to participants, researchers study different questions about facial expressions.

13 For example, Becker-Asano and Ishiguro performed keyframe facial expression syn- thesis on their robot Geminoid F to study inter-cultural differences in perceiving emotions. The researchers manually adjusted the actuators through a software interface to make six basic emotions including happy, angry, sad, fearful, surprised and neutral on its face. Then, they showed pictures of the robot’s face to participants. They also used these keyframe syn- thesized facial expressions to compare the recognition rate when they are expressed by the robot compared to the model person of Geminoid F with the robot [18]. Shayganfar et al. used keyframe synthesis to study static facial expressions versus dy- namic facial expressions. They proposed an algorithm based on FACS to perform keyframe emotion synthesis for any given humanoid robots. They tested their method with Melvin a humanoid robot with a cartoon like face without skin [165]. One disadvantage of keyframe synthesis is that it is not very accurate for synthesizing dynamic expressions because one may lose the temporal order of details in synthesizing dynamic expressions. In addition, this method is also time and data intensive [178]. How- , keyframe synthesis is simple and does need a live performer.

2.2.3 Performance Driven Synthesis

Performance driven synthesis was proposed as an improvement to keyframing [35]. Performance driven synthesis is commonly used for a robot to mimic a performer’s face. This technique includes recording motions from a live actor and mapping them to a robot’s degrees of freedom. Marker-based motion capture systems were proposed as the first method to capture facial expressions of a live performer [25, 177, 183]. Later, marker-less approaches were proposed as an improvement [19, 26]. In most marker-less systems, the input data are images or videos. Basically, the general idea is that a face tracker is used to track some facial points from the performer. Then, the tracked points are mapped to control points of the robot. Figure 2.4 shows a marker-based versus a marker-less based synthesis system.

14 Figure 2.4. Right: an example of markerless citemcmaster face tracking, left: an example of marker based face tracking [183].

Usually, a training step is required for performance driven synthesis. In the training step, a face tracker is used to track some facial points of the training frames. Then, these tracked points are manually mapped to the control points of the robot. This means that an animator moves the control points of the robot to find the best match with the training frames. Subsequently, these frames are used to find the mapping criteria between the feature points and control points of the robot. After the training phase, for a new input video sequence, the mapping criteria that is already found is used to find the movement of the servo motors and animate the robot. Jaeckel et al. presented a performance driven synthesis method for a robotic head to mimic a person’s facial expressions. Their method maps 2D facial feature points of a person to the servo motors of a robot’s face. The authors employed Active Appearance Models (AAMs) to track the coordinates of facial features in each frame within a video. In the training step, the tracked AAM facial features are translated into servo commands for controlling the robot’s pose. The training step includes keyframe animation for some frames of the input video and manually set the servo motors of the robot [90]. An actress expresses some face poses and an Active Appearance Model is fitted to the input frames. This fitting gives 2D coordinates of 25 landmarks plus 16 parameters related

15 to texture. Scaling and normalizing is used to compensate for the global head translation. Then, an animator performs keyframe animation on the robot head to match its face with the training frames of the person. They apply multilinear Partial Least Squares regression to compute the parameters for a model that translates the 25 facial landmarks to the 34 servo values. After the training step, researchers use the obtained model from the training step to pre- dict robot poses for any unseen input AAM-shapes obtained from the test video sequence. The researchers used the humanoid robot “Jules” with 34 servo motors on its face and neck to evaluate their synthesis method. One advantage of performance driven synthesis compared to keyframe synthesis is pre- serving the timing and order of activating different facial parts. Human facial expressions, speech, gaze and gestures are synchronized and psychologically linked [39]. For a robot to appear natural and human-like, representing the synchronization and temporal order of facial parts is important. In keyframe animation, providing this synchrony is harder than in performance driven animation. This is because in performance driven approaches the data comes from a live- performer and thus naturally synchronized. However, in performance driven approaches, a performer is required for at least the training phase of the system. Keyframe animation is simpler and might be enough for studies about different aspects of static facial expression synthesis of basic emotions. The majority of robots use fixed emotional expressions and they are animated by repeti- tive routings [17, 18, 32, 101, 134]. However, psychologists suggest that fixed “snapshots” of facial expressions do not necessarily present the emotional state since different emo- tional states have overlap [70, 90]. Some researchers suggested that there is no significant difference between evaluation of static (pictures) and dynamic (videos) expressions by participants [23]. Others argued that participants are more accurate in recognizing some dynamic expressions such as “happiness” and “disgust” [99].

16 To synthesize dynamic expressions there are two main approaches. One is mapping facial expressions of a subject to the robot’s face and the other is to define a limited num- ber of emotional states and transit between them with computational techniques such as interpolation [89, 90, 115, 130].

2.3 Synthesizing Facial Expressions on Patient Simulators

I my research, I use different techniques for synthesizing facial expressions on robots and apply them to one application: patient simulation. Many researchers in the fields of affective computing and clinical education are in- terested in patient simulation (c.f.,[94, 117, 164]). Simulated patients provide safe ex- periences for clinical trainees, where they can practice communication, assessment, and intervention skills, without fear of harming a real patient. (See Fig. 2.5). Although this technology is in widespread use today, commercial patient simulators lack sufficient re- alism. They have static faces with no capability to convey facial expressions, despite the vital importance of these non-verbal expressivity cues in how clinicians assess and treat patients [81, 159]. This is a critical omission, because almost all areas of health care involve face- to- face interaction [114]. Furthermore, there is overwhelming evidence that providers who are skilled at decoding communication cues are better healthcare providers: they have im- proved patient outcomes, higher patient compliance and satisfaction, greater patient safety, and experience fewer malpractice lawsuits [10, 37]. In fact, communication errors are the leading cause of avoidable patient harm: they are the root cause of 70% of sentinel events, 75% of which lead to a patient’s death [106]. In studying how individuals, teams, and operators interact with inexpressive simula- tors, our work suggests that commercially available systems are inadequate for the task of training students due to their complete inability to provide human communication cues [84][91]. In particular, these simulators cannot convey visual cues such as pain to medi-

17 Figure 2.5. Left: A team of clinicians treat a simulated patient, who is conscious during the simulation, but has no capability for facial expression. Right: A commonly used inexpressive mannequin head. cal trainees even though perceiving a patient’s nonverbal cues is an exceptionally important factor in how clinicians make decisions. Existing systems may be preventing students from picking up on patients’ non-verbal signals, possibly inculcating poor safety habits due to a lack of realism in the simulation [2, 80]. My work focuses on making patient simulators more realistic by enabling them to con- vey realistic, patient-driven facial expressions to clinical trainees. We are designing a new type of physical patient simulator with a wider range of expressivity, including the ability to express pain and other pathologies in its face. In our work, I use different techniques for synthesizing patient-driven expressions and pathologies such as pain and stroke onto patient simulators to make them more naturalistic. Our work can help develop an educa- tional tool for clinical learners to practice face-to-face communication with patients using patient simulation.

2.4 Chapter Summary

This chapter provided a brief background to different techniques for synthesizing facial expressions on robots and described the motivation behind my proposed work. Facially ex-

18 pressive robots can be applied to many different domains but I particularly focus on patient simulation. The remainder of this document expands upon our focus on this application. We begin our path by next chapter in which we present our new algorithm for evaluating various facial expression synthesis algorithms.

19 CHAPTER 3

EVALUATING FACIAL EXPRESSION SYNTHESIS ON ROBOTS

In the previous chapter we discussed common techniques in the literature for perform- ing facial expression synthesis on robots. However, in the robotic literature, less attention has been paid toward developing a standard evaluation method to compare different synthe- sis systems. Currently, the two most common ways to evaluate facial expression synthesis is to perform a subjective or computationally-based evaluation. In subjective evaluation, subjects judge the synthesized expressions, and in the computationally-based evaluation each synthesized expression is quantitatively compared with a predefined computer model. In this chapter, we investigated the strengths and weaknesses of these methods and pro- posed a new model that combines the advantages of both.

3.1 Background

In the broader field of human-computer interaction, several researchers have studied the classification of evaluation methods. For example, Benoit et al. proposed that based on the evaluation goal, there are three types of evaluations for a multi-modal synthesis system [21]:

• Adequacy evaluation: evaluates how well the system provides requirements de- termined by users’ needs. For example, a consumer report is a kind of adequacy evaluation.

• Diagnostic evaluation: evaluates system performance against a taxonomy, consid- ering possible usage in the future. System developers usually use this evaluation.

• Performance evaluations: evaluates the system in a specific area. To do so, a well- defined performance baseline is required for comparison. For example, an older

20 version of the same system or a different system which supports the same function- ality.

Of the three techniques, performance evaluation is the most commonly used in the literature [21]. Performance evaluation for a system requires defining the characteristics one is interested in evaluating (e.g. accuracy, error rate, or recognition rate) as well as an appropriate method for measuring them. Some criteria for performance evaluation of facial expression synthesis systems can be task completion time, complexity, realism of expressions, static versus dynamic expressions, and recognition rate. Generally, there are three ways of conducting a performance evaluation for a synthesis technique [174]:

• User-based: This evaluation method involves users completing some predefined tasks, which match the goals the system is designed for. By analyzing data collected from users, the system performance is evaluated.

• Expert-based: This evaluation method is similar to user based evaluation except users are expert people in that area. For example, in case of facial expression syn- thesis, users may have expertise in decoding facial expressions (e.g. certified Facial Action Coding System (FACS) coders).

• Theory-based: This evaluation method does not require live human machine in- teraction. Instead, it tries to predict the user’s behavior or performance based on a theoretically derived model.

From another point of view, evaluative methods can be divided into two classes: quan- titative and qualitative [102]. Quantitative evaluations compare synthesized values with real values to investigate whether synthesized movements are correct or not [21]. For ex- ample, one can compare the values of lip height and lip width in a synthetic smile with corresponding values of a human subject. By using an image based FACS analysis, one can compare muscle contradiction in a real image with a synthetic expression. However, this is a complicated task since different facial parameters may have different levels of im- portance in conveying an emotion and one needs to find an appropriate weight for each of them [21]. For example, Bassili [15] and Gouta and Miyamoto [69] found that negative emotions are usually expressed by the upper part of the face, while positive emotions are

21 expressed in the lower part [131]. Computing appropriate weightings of each facial part in an emotional expression is still an open question [169]. A qualitative evaluation compares the intelligibility of a synthesized expression with the intelligibility of the same emotion expressed by a human. In other words, a qualitative evaluation checks how the synthesized expressions are perceived [21]. Berry et al. [24] proposed that evaluations should be done on different levels of a synthesis system including:

• Micro level: observes just one aspect of the robot separately of other parts (e.g. just lip movements).

• User level: evaluates the reaction of users to the system.

• Application level: evaluates the system within a specific application.

The most common evaluation techniques used by researchers in the field of facial ex- pression synthesis are subjective evaluation, expert evaluation, and computational evalua- tion.

3.1.1 Subjective Evaluation

In the literature, subjective evaluation is the most commonly used approach to evaluate facial expression synthesis systems. Subjects observe synthesized expressions and then answer a predefined questionnaire. By analyzing collected answers, researchers evaluate the expressions of their robot or virtual avatar. Although user studies are costly and time consuming, they provide valuable information about the acceptability and believability of a robot. Moreover, humans are very sensitive to errors in facial movements. A wrong movement, a wrong duration, or a sudden transition between facial expressions can be detected by a human. This is especially true when the robot or virtual agent is more realistic in appearance [138]. A subjective evaluation requires a well-defined experimental procedure and a careful analysis of the results. Several methodological issues should be considered when choosing

22 participants, such as: their average age, their educational level, their cultural background, their native language, and the gender ratio of participants [142]. A second experimental de- sign concern is that some facial expressions have different meanings in different cultures; for example, in India, the same head nod used to express agreement may convey disagree- ment in a Western country [148]. Becker-Asano and Ishiguro, with robot Geminoid F, studied the effects of intercultural differences in perceiving the robots’ facial expressions [18]. They found that Asian participants showed low agreement in what constituted a fearful or surprised expression in contrast to American and European participants. A third experimental design concern is how to best select subjects to participate in the evaluation. Christoph did research in this area, suggesting that an ideal group of subjects are those most likely to use the robot in the future [42]. For example, a robot designed for autism therapy should be evaluated by autistic children since the robot is intended to be useful for them. Another important aspect of subjective evaluation experimental design is adequately preparing subjects for interaction with the robot. For example, Riek et al. showed subjects a picture of an unusual-looking zoomorphic robot before the experiment in order to help them adequately habituate to it and avoid uncanny effects [150]. This step may prove particularly important for subjective evaluation of very realistic looking robots so as not to shock participants. Several examples of strong experimental design for subjective synthesis evaluation are in the literature. For example, Becker-Asano and Ishiguro performed a subjective study to compare expressivity of Geminoid F’s five basic facial expressions with the expressions of the real model person [18]. After watching each image of the robot expressing different facial expressions, subjects could choose one of the six labels including angry, fearful, happy, sad, surprise or neutral. To study intercultural differences in recognition accuracy, they performed the experiment in German, Japanese and English language. The interface the researchers used in their subjective evaluation as well as the model person can be seen

23 Figure 3.1. Subjective evaluation example from Becker-Asano and Ishiguro [18]; left: Geminoid F (left) and its model person (right), right: the interface. in figure 3.1. In another study, Mazzei et al. used a subjective method to evaluate the robotic face FACE in conveying emotion to children with Autism Spectrum Disorders (ASD) [116]. They recruited participants from both the ASD and non-ASD population. Participants were asked to recognize and label the emotions expressed by the robot including happiness, anger, sadness, disgust, fear and surprise. The correctness of the labels were determined by a therapist. The interface they used for their subjective experiment is depicted in figure 3.2. Another kind of subjective evaluation is a side-by-side comparison or ”copy synthesis” in which researchers try to synthesize facial expressions which best match a recorded video or image [3, 190]. Subjective evaluation about the quality of synthesis is done through a visual comparison between the original videos and the synthesized ones. A sample figure of this side-by-side comparison is shown in figure 3.3. Subjective evaluation is also very common for analyzing human perception of virtual agents [24, 132]. However, in the case of physical robots, their physicality changes both the physical generation of synthesis as well as the evaluation. Moving motors on a robot is different and more complicated than animating a virtual character. The number of motors,

24 Figure 3.2. Subjective evaluation example from Mazzei et al. [116]. their speed, their range of motion, and the synchronization between them are among the factors which make the evaluation more complicated. Several evaluation metrics should be defined. We are concerned about questions such as: does motor noise affect immersion by being distracting? Are we more sensitive to perfection, speed or realism in physical motion versus virtual motion?

3.1.2 Expert Evaluation

In an expert evaluation, a trained person, usually a psychologist, evaluates the synthe- sized facial expressions to determine whether the system provides predefined criteria or not. However, recruiting a sufficient number of experts is not easy. Moreover, often the robot is ultimately designed for users from the general population, not highly trained ex- perts. A sole reliance on experts could create acceptability problems of the robot down the road.

25 Figure 3.3. An example of a side-by-side comparison. On the left is the real image, and on the right is the synthesized [190].

3.1.3 Computational Evaluation

A computational evaluation aims to provide a fast and fully automatic evaluation method, which does not require human judges. Designing such an evaluation can remove many of the aforementioned problems with subjective evaluations. This evaluation method requires one to develop an accurate computer model of facial parts for each facial expression. Based on this model, the system compares the synthesized model with the computer model. Some researchers have worked on developing a computational model of facial parts such as muscular and skin models or lip shapes [71, 104, 131]. However, designing an accurate computational model for each facial part is very complicated. Moreover, one cannot use the same model for different robot platforms because each robot has its own characteristics such as the number of control points. For example, the physical robot, Doldori, does not have any control points for cheek movements [103].

26 3.2 Proposed Method Based on Computing Subjective and Computational Evaluation Methods

An ideal evaluation method should incorporate the advantages of both subjective and computational evaluations. In order to develop a method incorporating subjective and com- putational evaluation, we consider the two following important facts. First, social robots are designed to interact with people. Therefore, it is important to investigate how people perceive the robot by involving them in the evaluation to ensure that the robot is visually appealing. Second, different synthesis methods should be compared on the same robot platform in order to have a fair evaluation. Our model [119] is based on a model proposed by Ochs et al. [133]. The researchers developed the “E-smiles-creator” as a web application to analyze facial parameters of a virtual character for different kinds of smiles (amused, polite, and embarrassed). In E- smiles-creator, the user is asked to create different kinds of smiles on the face of a virtual agent by manipulating its control points. Subjects also choose their degree of satisfaction at the end. The interface designed by Ochs et al. is shown in figure 3.4 [133]. In this interface, the kind of smile that the user should produce (e.g. polite smile) is in the upper quadrant and all possible parameters that the user can change (e.g. duration) are shown in left quadrant. Each time a user changes one parameter, the new video of the expression is played and the users also choose their degree of satisfaction. Considering all possible parameters of their virtual agent, one may produce as many as 192 different kinds of smiles. Then by using a decision tree they characterized parameters for each kind of smile. In the decision tree, they accounted for the degree of satisfaction for each user-produced smile. Therefore those with a higher degree of satisfaction are counted as more important. We use this technique in the subjective part of our evaluation model. Our evaluation method has three steps: Step 1: We need a baseline to compare all the methods against. This baseline should

27 Figure 3.4. E-smiles-creator [133]. be the best expression that a human can perceive from that robot with regards to all char- acteristics and limitations of the robot. In our proposed evaluation model, this baseline is determined with a subjective study. This step requires designing appropriate software as an interface between the participant and the robot. Mazzei et al. designed such an interface to control the motors of the physical robot FACE by subjects [116]. In this study, subjects are asked to change control points of the robot to create a list of different expressions. For each expression, they are asked to create the expression that they think is the best expression they can create with that robot. At the end, they also choose their degree of satisfaction on a predefined scale. Step 2: In the second step, a decision tree is used to analyze the parameters, which subjects chose for each expression. For each of them, we also consider a weighting based on satisfaction degree. Therefore, expressions created with a higher degree of satisfaction will have greater effect. Step 3: The third step is comparing the synthesis method we want to evaluate with the model in the second step. This part is done computationally by comparing each parameter

28 in the synthesis output with its corresponding parameter from Step 2. In other words, we want to compare the synthesized expression computationally with a baseline we produced in step one by a subjective study. Step 3 has the same procedure as a computationally-based evaluation in Section 3.1.3, except instead of having a computer model of expressions, we develop the baseline from a user-based study. Our proposed evaluation model can be used for evaluating a synthesis method on both the micro level and application level [24]. For the micro level, one can perform all three steps of our model on just one aspect of the robotic face (e.g. just lip movements). For the application level, one can perform all three steps of our model within a specific appli- cation (e.g. autism therapy). However, for a user level evaluation, traditional subjective based methods should be employed since the focus of this evaluation is solely on people’s reaction to the robot.

3.3 Challenges in Synthesis Evaluation

In evaluating synthesized facial expressions, there are several essential characteristics that must be captured: realism, naturalness, pleasantness, believability, and acceptability. Thus, evaluation of facial expression synthesis for both robots and virtual avatars is chal- lenging. One reason is that defining the evaluation criteria is not easy since qualitative aspects play an important role [21]. Comparing different synthesis methods is difficult without having a standard evaluation criteria. There are several other issues to be considered in evaluating facial expressions. Having an aesthetically appealing robotic face is not sufficient for obtaining realism. [49, 64]. To synthesize realistic and believable expressions robotic faces should be consistent, have a smooth transition, and match the context, state of mind and perceived personality of the robot [49]. Facial expressions of the robot should be synchronized with other behaviors such as head and body movements. Modeling all the above parameters for a computational evaluation method is compli-

29 cated. Additionally, computational evaluation requires determining a computer model for different facial expressions (e.g. neutral face, happy, and sad) with different intensities as a baseline to compare different expressions with. Defining such models is challenging. Therefore, having a computational evaluation method without any human which works with all robotic platforms and all different synthesis methods is still an open question. The appearance of the robot may affect judgment about synthesized expressions. Kidd et al. performed a subjective study with a physical robot and an animated character and found that a physical robot is perceived more informative for human subjects than a virtual character [96]. Numerous characteristics may affect a person’s judgement, such as whether the robot is physically co-located with the participant, the robot’s morphology, its perceived gender, and its perceived age. Besides, choosing the appropriate synthesis method greatly depends on the robot one wants to implement synthesis on. The number of motors of a physical robot or control points of a virtual avatar has a great effect on choosing a synthesis method. To evaluate a synthesis system, synthesis quality may be measured computationally as well as by human judgment. Related to human evaluation, several psychological issues should be considered such as person’s underlying attitude toward robots, as well as in- dividual differences which may affect their judgment [151][149]. Also, there are always human errors in subjective studies. For example, separating fear from disgust is hard for human subjects especially, on facial robots without nose wrinkles [103]. From this set of challenges one can see the diversity of factors which should be consid- ered when selecting the best synthesis method. It is necessary to carefully evaluate these choices and their effect on the robot. Evaluation of synthesis systems is still relatively new and no benchmarks or standard evaluation procedures exist in the literature [24].

30 3.4 Chapter Summary

For social robots to become acceptable in our daily lives, having realistic human robot interaction is important. Therefore, it is important to make robots capable of generating appropriate expressions understandable by humans. Designing a standardized method for evaluating synthesized behavior is an open research question, and relatively unexplored in the literature. In this chapter, we presented techniques for facial expression synthesis evaluation and outlined major challenges in this field in order to guide researchers. We also proposed a model for evaluating facial expression synthesis methods which incorporates both sub- jective as well as computationally-based evaluation techniques. In the next chapter, we propose a performance driven synthesis method for mapping facial expressions of real patients on three virtual avatars and we evaluate our synthesis method based on the this chapter.

31 CHAPTER 4

PAIN SYNTHESIS ON VIRTUAL AVATARS

As I discussed in Chapter 2, my research focuses on making patient simulators more realistic by enabling them to convey realistic, patient-driven facial expressions to clinical trainees. This chapter presents one aspect of our work, which includes research questions surrounding synthesizing realistic painful faces on a virtual avatar and evaluating how they are perceived. Here, naturalistic refers to non-acted, real-world data obtained “in the wild” [53]. This research fills a gap in the virtual patient problem domain, because although there is a growing body of literature on automatic pain recognition [8, 75, 112], there is little published work on automatic pain synthesis, particularly using realistic data. This work also will enable medical educators to improve their face-to-face communication skills and pain recognition skills, which, according to the literature, are both in need of attention [43, 81, 159]. In this chapter, we describe our technique for realistic pain synthesis on virtual pa- tients, and report on several perceptual studies to validate the quality of the synthesis. For synthesis, we used a constrained local model (CLM)-based facial feature tracker applied to examples from the UNBC-McMaster Pain Archive [113]. We modeled our perceptual experiment on work by Riva et al., where participants clas- sified videos of three types of synthesized facial expressions: pain, anger, and disgust, across three genders - male, female, and androgynous [152]. Riva et al. considered anger and disgust as reasonable expressions for comparison because of their “negative valence and threat-relevant nature”. They explored avatar gender variations, because previous work

32 in the field suggested a relationship between actor gender and pain detection accuracy [82]. In their work, the facial expressions were manually created by an animator using FaceGen 3.1, and reviewed by experts trained in the Facial Action Coding System (FACS). Riva et al. [152] had two findings of note. First, they found participants were less accurate in decoding expressions of pain compared to anger and disgust (similar to other work, c.f. [17, 23, 93]). Second, regardless of gender, participants had better pain detection accuracy for male avatars. Given a realistic approach to synthesis, we wondered if the findings by Riva et al. would hold, and, thus, replicated their experiment. We had two main research questions. First, are participants able to distinguish expres- sions of pain from anger and disgust, and how do their accuracies differ? Based on findings by Riva et al. [152], we predicted that overall, participants will be more accurate at de- tecting disgust compared with pain, and more accurate at detecting anger compared with disgust. Second, how does an avatar’s gender affect pain detection accuracy? In addition to being curious if we can replicate findings by Riva et al. [152], we also were interested in evidence-based insights into how to design our physical robotic patient. We eventually need to make decisions about the apparent gender of the robot, and this will require careful weighing of our findings. Based on findings by Riva et al., we predicted that pain detec- tion accuracy will be lower overall when expressed on a female avatar compared to a male avatar. In this work, we did not explore the effect of participant gender. The reason is that despite findings about how it affects accuracy in detecting some aspects of expressiv- ity (e.g., arousal and valence) [22, 166], there is no evidence to suggest it affects overall categorization. Our methodology, described in Section 5.2, addresses these research questions through a 3×3 online study in which subjects labeled videos of male, female, and androgynous avatars displaying pain, anger, and disgust. Our results, discussed in Section 4.3, showed that participants were able to distinguish

33 facial expressions of pain from anger and disgust by performing realistic synthesis, and were less accurate in decoding disgust compared to pain and anger. Furthermore, we did not find support for the avatar gender finding by Riva et al. In our data, avatar gender did not have significant effect on pain detection accuracy. Finally, our results suggested that realistic pain synthesis on virtual avatars is comparable to manual pain synthesis, and arithmetically, may be better. We discuss the implications of our findings in Section 6.3.

4.1 Background

Similar to other expressions of emotion, facial expressions of pain are an important non-verbal communication signal, particularly in healthcare [74, 139]. Until recently, self- reporting and clinical observations were the primary ways used to detect pain. However, these methods have several issues. For example, self-report cannot be used for children or patients with communication challenges (e.g. cognitive impairments, unstable states of consciousness or lucidity, etc.). Moreover, there are differences between how clinician and how patients conceptualize pain, which can lead to problems [8, 112, 139, 186]. Psychologists propose that pain is expressed by certain facial movements. While there is some research suggesting pain can be idiosyncratic [9, 154], we approached our research with respect to the Facial Action Coding System (FACS), which states pain can be inter- preted universally from face. FACS uses 46 action units (AUs) as its building blocks to code facial expressions. This system was developed initially by Ekman [56] to code ba- sic emotions based on facial muscle activities, and later was applied to pain, notably by Prkachin et al. [112] and Craig et al. [167]. To date, several research groups have worked on techniques for automatic pain detec- tion, a process that involves automatic facial feature extraction and training of classifiers to detect pain. For example, Ashraf et al. classified videos from the UNBC-McMaster Shoulder Pain Expression Archive Database into pain/no pain categories using machine learning approaches [8, 113]. Their feature extraction was based on Active Appearance

34 Models (AAM), which we describe in more detail in Section 7.6.2. The researchers de- coupled shape and appearance parameters from facial images and used Support Vector Machines (SVM) for classification. Others have also explored automatic pain recognition [75, 118, 140]. Despite the aforementioned work on automatic pain detection, there is little work on automatic pain synthesis. In our work, we studied features that were used in the litera- ture for pain facial expression detection and instead employed them to synthesize facial expression of pain.

4.2 Methodology

We employed performance-driven synthesis of pain, anger, and disgust on three virtual avatar faces (female, male, and androgynous) to answer the aforementioned research ques- tions. In Chapter 2, we explained that performance-driven synthesis is a commonly used animation technique that tracks motions from either a live or recorded actor and maps them to an embodied agent, such as a virtual avatar or physical robot [183, 187]. This technique has been used in the literature to synthesize a wide range of realistic facial expressions, but not pain [19, 26]. In our work, the source videos we used for pain synthesis came from the UNBC- McMaster Pain Archive [113] and the source videos for anger and disgust from the MMI database [137]. We used ten source videos for each expression type. Our stimuli cre- ation process included four steps. First, we used a Constrained Local Model (CLM) based tracker to extract 66 feature points frame-by-frame from each source video. Next, we mapped the extracted feature points to the virtual character control points for animation in Steam Source SDK. Then, we animated three different virtual characters (female, male, androgynous) per each expression type, resulting in 90 stimuli videos. Finally, we ran several pilot studies to label both gender and expression, and to establish which videos to include in our main study. This resulted in 27 stimuli videos.

35 4.2.1 Source Video Acquisition

For the source videos depicting painful expressions, we used the UNBC-McMaster Shoulder Pain Expression Archive Database [113]. This is a fully labeled, realistic data set of 200 video sequences from 25 participants suffering from shoulder pain (52% female). Participants performed range-of-motion tests on both their affected and unaffected limbs under the instruction of a physiotherapist. At the frame level, each frame was coded using the facial action coding scheme (FACS), and contained 66-point AAM landmarks. Each frame also received a pain score ranging from 0 to 12. At the sequence level, each video has both self-report and observer ratings of pain, the latter ranging from zero to five. In our study, we only included videos in which pain was present. Similar to Ashraf et al., we considered pain to be present in a sequence if its observer rating was three or greater, and pain to be absent if its observer rating was zero [8]. For the source videos depicting anger and disgust, we used the MMI database [137]. This is a database of posed expressions from 19 participants (44% female) who were in- structed by a FACS expert to express six basic emotions (surprise, fear, happiness, sadness, anger and disgust). Each video begins with a neutral expression and then transitions into the target expression. We included source videos from these two databases that were determined by two human judges to be accurately tracked by our face tracker. Judges watched the source videos with the CLM mesh drawn on the face and rated all videos on a scale from one to four, depending on how well the mesh aligned with the face throughout the video. We only included videos in which both judges gave the video a tracking score of one. This resulted in 10 source videos from each expression category (pain, anger, disgust). Figure 4.2 (top) shows some sample frames from these databases.

36 4.2.2 Feature Extraction

For tracking facial features, we used Constrained Local Models (CLMs), which we explained in detail in Chapter 2. To our knowledge, CLM-based models have been mostly used in the literature for face tracking, or expression detection, not for synthesis [12, 41]. In our work, we used this technique to synthesize facial expressions of pain, anger, and disgust. As we discussed in Chapter 2, Baltrusitus et al. introduced a 3D Constrained Lo- cal Model (CLM-Z) method for detecting facial features and tracking them using a depth camera, such as a Kinect [12]. This method is robust to light variations and head-pose rotations, and is thus an improvement over the traditional CLM method. For our work, we were able to create our stimuli based on a wide range of source videos using the CLM-Z tracker [12]. However, because our source videos were pre-recorded and did not contain depth information, we were not able to take full advantage of the CLM-Z tracker. In the future, when we transition to performing real-time facial synthesis on an robot, we will employ depth information to increase synthesis validity.

4.2.3 Avatar Model Creation

We used three avatar models in this work: female, male, and androgynous. When creating the avatar models, we aimed to remove any effects of age and ethnicity. In order to generate our avatars, we extracted avatars from the video game Half-life 2 from the Steam Source SDK. We used the program GCFscape to extract our textures from the video game files, and used a program called VTFEdit to convert the textures to modifiable TARGA files (a raster graphics file format). For the purposes of this experiment, we used the character (“Alyx”) as the base for our virtual avatars to ensure consistency of ethnicity and age [1]. Within Adobe Photoshop, certain areas were darkened or lightened to exhibit qualities normally attributed to male or female characteristics. For example, we enlarged the jaw-line, cheekbones, and chin dur-

37 Figure 4.1. The spectrum of avatar genders we created, with the final three highlighted. From right to left: male, androgynous, female. ing the creation of male-looking avatars. Similarly, androgynous textures also employed similar changes, but to a smaller degree. We also changed other areas of the face to create variation among these textures. We created a total of twenty different texture variations employing these changes for our pilot to determine our androgynous, female, and male avatars. Similar to the stimuli created by Riva et al. [152], we cropped the face to remove any neck, clothing, or hair visibility to avoid any unintentional conveyance of gender cues. We ran a pilot study to establish ground truth gender labels for each avatar’s gender, following the methodology of Riva et al. [152]. The goal of this pilot was to select three avatar models as distinctly female, male, and androgynous out of the 20 models we made. We had 16 American participants, 11 female, mean age 44.5 years old. Participants were recruited using Amazon MTurk. Participants viewed 20 still images of the avatars in a random order and labeled their gender using an 11-point Discrete Visual Analogue Scale (DVAS). A zero on the scale corresponded to “completely masculine” and ten to “completely feminine”. We used these results to select our female, male, and androgynous avatar models. We chose the female model with score of 9.13, the male model with a score of 1.81, and the androgynous model with a score of 4.88. See figure 4.1 for the final three avatar models. The average score for each of the three chosen avatars ensured that we could use these three avatar models as female, male, and androgynous in our main experiment. The next step was to animate each of these three avatars using the 30 source videos.

38 4.2.4 Stimuli Creation and Labeling

We initially created 90 stimuli videos. We had three avatar models (female, male, androgynous), three expression types (pain, anger, and disgust), and ten samples of each expression. We tracked 66 facial feature points frame-by-frame from each of our 30 source videos. We removed rotation, translation, and scaling based on eye corner positions. We measured movement of each point in relation to a normalized frame that was calculated during runtime. We then measured the movement of each point. In order to map our facial points to our avatars, we generated source files that the Source SDK is capable of understanding. To do this, we ran the CLM tracker twice on each video. In the first run, we calculated the maximum movement of each point on the face. This was done to have a maximum scale factor for Source SDK to have as a reference point. In the second run,we computed the movement of each feature point in relation to this maximum scale factor. To do this, we divided the movement measured in the second run by the maximum movement for the same point measured in calculation step. This gave us a ratio value between zero and one (zero being neutral, and one being the maximum amount a given point can move) for use with the feature mapping in Source SDK. We recorded each of the three avatar types enacting the expressions. The play-back duration for each expression was manually adjusted to be 0.3 times slower to compensate for slight variations in how some of the source videos were tracked. After generating the recordings using CamStudio, we cropped the stimuli videos to be three to five seconds long to ensure consistency. Figure 4.2 shows example frames of the created stimuli videos. We ran a second pilot study to decide which stimuli videos would be included in our main study, and to establish their ground truth expression labels. Each video was modified before being used in the pilot. A black screen with a white crosshair was added to the beginning of each video and appeared for exactly 2.5 seconds to prepare participants for the stimulus video. Then, a facial expression on a virtual character was presented for

39 Figure 4.2. Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. The pain source videos are from the UNBC-McMaster pain archive [113]; the others are from the MMI database [137].

2-4 seconds, followed by a black screen. We hosted the videos on Vimeo, and used an HTML 5 video player to remove all logos or player options. The pilot was conducted on SurveyMonkey. The source and stimuli videos had slightly different lengths due to the fact that source videos of pain came from a different database than the videos of disgust and anger. We cut these videos to be nearly equal in length without removing informative frames from each video. 20 participants were recruited using Amazon MTurk, 11 female and 9 Male. All partic- ipants were American, and their ages ranged from 22-58 years old with mean age of 38.05 years old. Participants who participated in our previous pilot were excluded from this pilot. Participants were only allowed to view each video once and were allowed two 30 second breaks where a nature video was shown. Participants watched the 90 stimuli videos in a random order and labeled the avatar’s gender and expression. For gender labeling, we used the same scale as in Pilot 1 (an 11-point DVAS). We

40 aimed to ensure that gender classification was the same as the first pilot when expressions were actually animated on the avatars. We found this was the case: Cronbach’s α = 0.961, indicating high inter-rater reliability on gender labeling. Expression labels were fixed choice - anger, disgust, pain, and none of the above. This labeling approach was based on work by Tottenham et al., who found a semi-forced choice method was less strict than a forced choice method (to which Russell objects [157]), while being more easy to interpret findings from than a free-choice method [181]. We calculated the accuracy of each of our videos across our participants, and chose the three best videos of each expression that had the highest average accuracy across our three genders. We had average accuracies of 80%, 75%, and 63.33% for pain, 80%, 71.67%, and 63.33% for anger, and 33.33%, 31.67%, and 28.33% for disgust. We were not surprised by the low detection accuracies for disgust, since it is known to be a poorly distinguishable facial expression in the literature [17, 23].

4.2.5 Main Experiment

Following the pilot, we selected three videos of each expression with the highest accu- racy across our three avatar genders to use in our main experiment, resulting in 27 videos. Videos were prepared and presented in the same format as our previous pilot and random- ized accordingly. We recruited 50 participants using Amazon MTurk. Again, participants were eligible only if they did not participate in our previous studies. Participant ages ranged from 20-57 (mean age = 38.6 years). Participants were of mixed heritage, and had each lived in the United States at least 17 years. Participants were asked to label the avatar’s expression in each of the 27 videos. The results from the main experiment are described in the subsequent sections. We measured accuracy (correct or incorrect) across our two independent variables (gender and expres- sion type). We describe the statistical details of our analysis below, but first present a brief

41 summary.

4.3 Results

4.3.1 Summary of Key Findings

Our first research question was to explore if participants are able to distinguish pain from expressions of anger and disgust. Table 4.3 shows that expression is a significant pre- dictor for accuracy. Therefore, we found participants were able to distinguish these three expressions. The results further showed that participants were more accurate in detecting pain than two other expressions. Thus, we did not find the same accuracy ordering as Riva et al. [152]; i.e., disgust > pain > anger. Our second research question concerned the effect of the avatar’s gender on pain de- tection accuracy. As seen in Table 4.3, gender is not a significant predictor for accuracy as the p-values for all the three genders are greater than .05. This suggests that there is no significant relation between an avatar’s gender and pain detection accuracy. Thus, we were not able to replicate findings by Riva et al. suggesting that people are more accurate at detecting pain when it is expressed on a male face [152].

4.3.2 Regression Method

We had one dependent variable and two independent variables. The dependent vari- able derived from the expression classification task was accuracy (i.e. classification of the expressions as pain, anger, or disgust). Accuracy is based on the ground truth that we gained from our pilot studies. We had two categorical independent variables. The inde- pendent variables were expression with three levels (pain, anger, and disgust) and gender with three levels (androgynous, male, and female). The dependent variable was analyzed using an appropriate within-subjects binary lo- gistic regression since the only dependent variable is binary (1: accurate, 0:inaccurate). In

42 the following analyses, significant effects are those with p-values <.05. Table 4.1 shows the details regarding the exact number of errors each participant made in the classification of 27 videos. In this classification, we considered an answer correct if participant’s label matched with the source video label. The percentage of correct classifi- cations was computed across each of the three expressions types within each of the three genders. 3 (Expression: pain, anger, or disgust) × 3 (Gender: androgynous, male, and female).

Table 4.1 indicates the details of errors for each expression and each gender. Partici- pants’ answers were classified as either accurate or inaccurate. Overall, participants labeled 622 videos incorrectly, representing 46.07% of our responses. The independent variables (gender and expression) were significant predictors for the dependent variable (see Table 4.2). We compared the full model with two predictors (gen- der and expression) with the restricted model with just a constant factor. The results of the analysis of the full model with two predictors (independent variables) suggest a significant effect of the set of predictors on the correct identification rate as the dependent variable. The Wald test in Table 4.3 shows the degree to which each expression affected accu- racy. While the chi-square value in Table 4.2 shows that predictors together have signifi- cant effect on the model, the Wald test is the significant test for each individual predictor separated. Table 4.3 shows the effect of each individual independent variable on the classi- fication rate. The standardized Beta value represents the weight that each predictor has in the final model. Negative weight shows a negative relation. Since the regression was run with pain as the reference value, it does not have a Beta value. SPSS considers pain as the base expression and androgynous as the base gender. Thus, these columns are empty for these two predictors. The results of the Wald test suggest that disgust and three genders can be dropped from the model for accuracy prediction. The Wald test suggests that pain by itself is a significant predictor for accuracy, W = 151.609, p <.001. Disgust by itself is not a significant predictor

43 TABLE 4.1

FREQUENCIES AND PERCENTAGES OF HITS AND ERRORS IN THE MAIN STUDY

Overall Female Male Androgynous

Pain

Total number of responses 450 150 150 150

Correct answers 303(67.33%) 100(66.67%) 98(65.33%) 105 (70%)

Judged as anger 2 (0.44%) 0 (0%) 1 (0.67%) 1(0.67%)

Judged as disgust 31(6.89%) 10 (6.67%) 7 (4.67%) 14 (9.33%)

Judged as none of the above 114 (25.33%) 40 (26.67%) 44 (29.33%) 30 (20%)

Anger

Total number of responses 450 150 150 150

Correct answers 292(64.89%) 101(67.33%) 95(63.33%) 96(64%)

Judged as disgust 84(18.67%) 30(20%) 25(16.67%) 29(19.33%)

Judged as pain 44(9.78%) 10(6.67%) 20(13.33%) 14(9.33%)

Judged as none of the above 30(6.67%) 9(6%) 10(6.67%) 11(7.33%)

Disgust

Total number of responses 450 150 150 150

Correct answers 133(29.56%) 47(31.33%) 43(28.67%) 43(28.67%)

Judged as anger 120(26.67%) 40 (26.67%) 42(28%) 38(25.33%)

Judged as pain 91(20.22%) 26(17.33%) 34(22.67%) 31(20.67%)

Judged as none of the above 106(23.56%) 37(24.67%) 31(20.67%) 38(25.33%)

44 TABLE 4.2

OMNIBUS TESTS OF MODEL COEFFICIENTS

Chi-square df Sig.

Step 165.646 4 0.000

Step 1 Block 165.646 4 0.000

Model 165.646 4 0.000

TABLE 4.3

VARIABLES IN THE REGRESSION EQUATION

B S.E Wald df Sig. Exp(B) 95% C.I.for EXP(B)

– Lower Upper

Pain 151.609 2 0.000

Disgust -0.109 0.141 0.600 1 0.438 0.897 0.680 1.182

Anger -1.593 0.144 122.017 1 0.000 0.203 0.153 0.270

Step 1 Androgynous 0.759 2 0.684

Male 0.041 0.143 0.082 1 0.775 1.042 0.787 1.378

Female -0.081 0.142 0.324 1 0.569 0.922 0.697 1.219

Constant 0.737 0.130 32.118 1 0.000 2.090

45 for the accuracy, W = .600, p >.05. Anger by itself is a significant predictor for accuracy, W = 122.017, p <.001. None of the three genders are significant predictors for accuracy. Pain has the largest effect on accuracy prediction followed by anger.

4.4 Discussion

Participants were able to distinguish pain from anger and disgust in virtual patients created using automatic realistic synthesis. Thus, we found support for our first research question. Our results support Riva et al.’s findings that participants are able to detect pain from anger and disgust when being expressed by a virtual avatar face. Our results do not support the claim by Riva et al. that participants are less accurate in decoding the facial expression of pain compared to anger and disgust. Our results instead reflect the opposite - participants are more accurate in decoding facial expressions of pain compared with anger and disgust. Also, participants are more accurate at detecting anger compared to disgust. We have also found support that naturally driven pain synthesis is comparable to FACS- animated pain synthesis. To the best of our knowledge, this is the first work on naturally performance-driven pain synthesis. Riva et al. manually synthesized facial expressions on a virtual avatar, and found 60.4% as the overall pain labeling accuracy rate [152]. Our pain labeling accuracy rate was 67.33%. While we cannot statistically compare these results due to variability in the two experiments, arithmetically they are encouraging. These findings suggest that our method may be used for automatic pain expression synthesis without requiring a FACS-trained animator to manually synthesize painful ex- pressions. For practitioners and researchers in the clinical education community without such resources, this may prove beneficial. Our results do not support the previous findings by Riva et al. that pain expression recognition is a function of the gender of the avatar displaying it [152]. We did not find any significant relation between the avatar’s gender and the participants’ accuracy in detecting pain, anger, or disgust. This further lends support to the idea that automatic pain expres-

46 sion synthesis from realistic sources may, in some cases, be preferable to FACS/animator- generated synthesis. One limitation of this study was that the avatar model from the Source SDK had neither wrinkles nor control points around the nose area. Therefore, we could not map AU9, which is important in expressing disgust and pain [140]. Adding this action unit could help improve the detection accuracy for both disgust and pain. Another limitation was that the source videos for pain came from a realistic dataset, whereas the anger and disgust videos were acted and exaggerated. At the time this paper was published, we were not aware of any realistic databases for these expressions, but in the future this would be good to explore. Similar to Riva et al., we ran our experiments with lay participants [152]. The literature suggests that the expressions of pain can be clearly recognized and discriminated by lay participants [167]. However, in the future, we are also interested to see how different populations perceive painful facial expressions, for example, if clinicians at different stages of training perceive an avatar’s pain differently, see Chapter 5. Prkachin et al. showed that clinical experience with patients results in underestimating patients’ pain [139]. It would be exciting to test this hypothesis more thoroughly with our avatars and explore if interventions can be designed.

4.5 Chapter Summary

In this chapter, we presented our work on synthesizing realistic pain facial expressions on virtual avatars and we discussed its application in clinical education. To synthesize pain, we used a performance-driven synthesis method as discussed in Chapter 2, and evaluated it using a subjective evaluation discussed in Chapter 3. We compared our work with the study ran by of Riva et al. in which researchers used key-frame synthesis to manually synthesize pain on virtual avatars [152]. Our results showed that our method can be used to automatically synthesize pain on virtual patients.

47 In next chapter, we expand our work to synthesizing pain on a humanoid robot. We also describe our ongoing study on how clinicians perceive pain on robots compared with avatars.

4.6 Acknowledgment of Work Distribution

I thank Michael Gonzales for helping create the stimuli videos for the study described in this chapter and revise the paper we published from the results. I also thank Dr. Roya Ghiassedin for her help in the analysis of the collected results.

48 CHAPTER 5

PAIN SYNTHESIS ON ROBOTS

5.1 Introduction

Our work focuses on making patient simulators more realistic by enabling them to con- vey patient-driven facial expressions. As part of our work, we synthesize patient-driven facial expressions onto patient simulators. In Chapter 4, we explained our work on synthe- sizing realistic painful faces on a virtual avatar and ran a study to evaluate how they were perceived [65, 120, 121]. In this chapter, we focus on synthesizing realistic pain expressions on a humanoid robot and studying differences in pain perception when it is expressed by a robot compared to an avatar. We also study how clinicians might perceive pain differently from lay participants. This research has implications for both the HRI community and the clinical community. Our research also fills a gap in the patient simulation domain, because although there is a strong literature on automatic pain recognition [8, 75, 112], there is little published work on automatic pain synthesis, particularly using patient-driven data.

5.1.1 Our Work and Contribution

Clinicians should often judge the amount of pain that patients are suffering. How- ever, research shows that clinicians are likely to underestimate the intensity of patients’ pain [139]. Ignoring patients’ emotional needs can cause wrong diagnostic decisions or poor health outcomes [92, 192]. Improving emotion perception in clinicians is important especially since literature suggest that clinicians might have lower accuracy in facial ex- pression interpretation [63]. Furthermore, Hojat et al. [83] showed that medical students’

49 empathy decreases over time. Emotional cues and signs of distress in patients are also usually missed by clinicians especially when patients are not able to verbally express their distress [72, 107, 126]. Therefore, there is a need to develop tools to improve clinical learners’ skills in decod- ing patients’ nonverbal cues. Psychology literature suggests that training can be effective in improving one’s ability in detecting facial expressions [27, 72]. Patient simulators can be used as a training tool to improve observers’ skills in reading pained facial expressions. We are building a new generation of patient simulators with an expressive face able to con- vey facial expressions of pain. We wondered if the type of simulator, mannequin or virtual patient, matters for training clinicians to decode facial expressions of pain. This will be important for clinical trainers when choosing the most effective educational tool for the students in order to improve their ability in pain assessment and management. In our previous chapter, we explained our work on synthesizing realistic pain facial expressions on virtual patients and studied how people perceive pain compared with com- monly conflated expressions (anger, disgust) [121]. In this chapter, we expand our work to synthesizing realistic pain facial expressions on a humanoid robot. We explore differ- ences in perceiving pain expressed by a virtual patient compared with pain expressed by a humanoid robot. We also explore if clinical background affects the intensity of pain that participants perceive from a humanoid robot and its virtual model. For synthesis, we used the same approach as Chapter 4. In summary, we applied a con- strained local model (CLM)-based facial feature tracker to source videos from Binghamton- Pittsburgh 4D Spontaneous Expression Database (BP4D-Spontaneous) [191]. We modeled our perceptual study on our previous work [121] described in Chapter 4, where partici- pants classified videos of three types of synthesized facial expressions: pain, anger, and disgust. The motivation behind including anger and disgust in our study is their negative valence and threat-relevant nature that make them reasonable expressions for comparison [121, 152].

50 Unlike our previous work in Chapter 4, we did not vary the avatar’s gender in this study because we previously found that there is no significant relation between the avatar’s gender and participants’ accuracy in detecting facial expressions of pain. In our previous work in Chapter 4, we had two important findings: First, participants are more accurate in detecting pain compared to commonly conflated expressions (anger and disgust) while expressed by an avatar. Second, there is no significant relation between the avatar’s gender and participants’ accuracy in detecting pain. In this chapter we describe our work on studying how participants perceive pain ex- pressed by a humanoid robot differently from an avatar. Additionally, while we ran our previous study with lay participants, in this work we include clinicians to study if having a previous clinical education background affects accuracy of pain detection and intensity of perceived pain. Thus, we had three main research questions. First, are people more accurate in detecting pain expressed on a humanoid robot com- pared to a comparable virtual avatar? While other researchers in the HRI community have explored the robot vs. avatar question with somewhat conflicting results ([14, 20, 99]), the work has often been conducted on zoomorphic or other non-anthropomorphic plat- forms, and usually just focused on the six basic emotions. Both our application space (high-fidelity patient simulation) and our intended user population necessitates a higher- degree-of-freedom (DOF), human-like, facially expressive robot, and also necessitates the synthesis of more nuanced, varied-intensity expressions (such as pain, neurological im- pairment, etc). Second, does having a clinical background affect synthesized pain detection accuracy? The literature suggests that clinicians might have lower accuracy in overall facial expres- sion detection compared to laypersons [63]. Additionally, other researchers have found that clinicians tend to miss emotional cues and signs of distress in patients [72, 107, 126]. Based on these findings, we predict that clinicians will be overall less accurate in detecting synthesized painful facial expressions compared to lay participants.

51 Third, do participants with a clinical background perceive pain intensity on robots and avatars differently than lay participants do? Some researchers have shown that clinical experience with patients results in underestimating patients’ pain [45, 139, 141]. Based on these findings, we predict that the clinical trend toward an underestimation bias will also be present when rating intensity of synthesized pain (e.g., they will not be as good as laypersons). We address these research questions through an experiment in which participants per- ceived the expression of and the intensity of synthesized facial expressions (pain, anger, disgust) depicted on either a humanoid robot or its virtual model. We describe our method- ology in Section 5.2, results in Section 4.3, and contextualize our findings in Section 6.3.

5.2 Methodology

5.2.1 Background

Facial expressions of pain are important non-verbal communication signals, particu- larly in healthcare [74, 139]. Communicating pain via facial expressions is important for sufferers. While self-reporting is the main method of assessing pain, there are issues with this method. The most important issue is that self-report cannot be used for children or patients with communication challenges such as patients with cognitive impairments or unconscious patients. In such cases an observer should assess patients’ pain. However there are differences between how clinicians and how patients conceptualize pain, which lead to problems when making diagnostic decisions for these patients [8, 112, 139, 186]. Prkachin et al. compared pain assessment by three methods, including: self-reporting, observer judgment, and FACS-coded facial actions among patients with shoulder injuries [140]. The results suggested that when the pain is severe, all three ratings are highly correlated. In contrast, when the pain is submaximal, observer judgment is not correlated with the other two methods. Prkachin et al. showed that observers tend to underestimate

52 pain and clinical experience with patients increases this underestimation [139, 140]. This underestimation is important when critical decisions need to be made about the patients. Therefore, research on non-verbal cues of pain has important medical implications and there is a need in the clinical community to improve learners’ skills in assessing pain. Even though, a fair number of studies are available on automatic pain detection, there are only a few on pain facial expression synthesis on virtual avatars, and none on facial expression synthesis on physical robots. Riva et al. used keyframe synthesis to manually synthesize facial expressions of pain, anger, and disgust on a virtual avatar using FaceGen 3.1 with help of an animator [152]. Synthesized faces were reviewed and validated by experts trained in the Facial Action Coding System (FACS). They had two main findings: First, participants were more accurate in detecting anger and disgust compared with pain. Second, participants had higher detection accuracy in detecting pain for male avatars. In a pioneer work described in Chapter 4, we extended state of the art by automatically synthesizing facial expressions of pain, anger, and disgust on virtual avatars. We used a Constrained Local Model (CLM) based face tracker [12] to extract 66 feature points frame- by-frame from videos of the McMaster-UNBC Pain Archive [113] and MMI database [137]. We mapped the extracted feature points to control points of a virtual avatar from Steam Source SDK. We had two findings of note: first, we found that participants had higher accuracy in detecting pain compared with anger and disgust using naturalistic per- formance driven synthesis. Second, we did not find any significant relation between the avatar’s gender and participants’ accuracy in detecting pain. Our results showed that real- istic pain synthesis on virtual avatars is comparable with manual pain synthesis and it is even arithmetically better. There are only a few studies on pain facial expression synthesis and all of them were done with avatars [121, 152]. Moreover, none of these studies included clinicians even though the literature suggests that clinicians perceive pain differently from lay people. Thus, in a pioneer work we decided to explore pain facial expression synthesis on a hu-

53 manoid robot. In our work, we studied realistic pain synthesis on a humanoid robot in addition to its virtual model. We included both lay participants and clinicians in our study to investigate whether having a clinical education background affects pain perception. Our work gives new insights in developing the next generation of patient mannequins. Application of robots in healthcare is growing quickly [146, 147] and we believe that ex- pressive robots can considerably improve the realism in patient simulation technology.

5.2.2 Overview of Our Work

Using performance-driven synthesis we mapped source videos of pain, anger, and dis- gust on a humanoid robot and its identical virtual model to answer the three aforementioned research questions. As we explained in Chapter 2, performance driven synthesis includes tracking motions from a live or recorded performer and mapping them onto a virtual avatar or a physical robot [74, 183]. Until recently, this technique had been used for a wide range of expressions, but not pain. In our previous work, we used this technique for the first time to automatically synthesize pain facial expressions on a virtual avatar [121]. In our current work, we use this technique to synthesize pain on a physical robot in addition to its identical virtual model. In our work, the source videos for pain, anger, and disgust came from the Binghamton- Pittsburgh 4D Spontaneous Expression Database (BP4D-Spontaneous) [191]. We used five videos for each expression type. Our stimuli creation process is similar to our previous work described in Chapter 4 which included three steps. First, using a Constrained Local Model (CLM) based face tracker, we extracted 66 feature points frame-by-frame from each source video. Second, for each source video we mapped the extracted feature points to servo motors of Philip K.D. [73, 77], a humanoid robot from Hanson Robotics and control points of Alyx, a virtual character from Steam Source SDK. We mapped five source videos of three expressions (pain, anger, and disgust) to two agents (PKD and avatar) resulting in 30 stimuli videos. Finally, to decide which stimuli videos should be included in our main

54 study, we ran a pilot. This resulted in 18 stimuli videos.

5.2.3 Source Video Acquisition

At the time of running our previous experiment with virtual avatars described in Chap- ter 4, there was no realistic database for anger and disgust. Therefore, we used acted anger and disgust source videos from the MMI database. However, at the time of our current study, a new database including realistic source videos of pain, anger, and dis- gust was released [191]. Therefore, for the experiment in the current chapter we used the Binghamton-Pittsburgh 4D Spontaneous Expression Database (BP4D-Spontaneous) [191]. The BP4D-Spontaneous is a fully labeled database of realistic expressions [191]. Fig- ure 5.1 shows some sample frames from this database. The authors designed an activity for naturally eliciting eight facial expressions (happiness, sadness, surprise, embarrass- ment, fear, pain, anger, and disgust.) from people, across 41 participants (23 women, 18 men). The first task was to talk to the experimenter and listen to a joke to evoke happi- ness or amusement. Participants watched a documentary video about a real emergency situation which triggered sadness. Participants experienced surprise by suddenly being exposed to a loud siren. Embarrassment was triggered by playing a game in which the subjects improvised a silly song. Participants played a game with danger of physical injury which stimulated fear. Pain was experienced when the participants submerged their hands in ice water. Subjects were insulted by the experimenter to stimulate anger. Last, subjects smelled an unpleasant odor to evoke disgust. Figure 5.1 shows some sample frames from this database. The researchers captured both 2D and 3D videos for each task. At the frame level, each video has metadata including manually annotated facial action units (FACS AUs), auto- matically tracked head poses, and 2D/3D facial features. In 3D feature tracking, 83 feature points were tracked using a method which involved fitting a multi-frame constrained 3D temporal deformable shape model (3D-TDSM). For 2D feature tracking, researchers used

55 Figure 5.1. Sample frames from the BP4D database. From left to right: disgust, happiness, pain, anger. a CLM-based tracker to automatically track 49 feature points frame by frame in each of the videos. All the 2D tracked videos were reviewed offline and labelled as “Good tracking,” “Mul- tiple errors,” “Jawline off,” “Occlusion,” and “Face out of frame.” The researchers chose a tracking score for each frame and the frame was reported as lost-track if the score was lower than a threshold. At the sequence level, each video has participants’ self report ratings and observer ratings. After each task, participants labeled the intensity of their emotions using a 6-point Likert-type scale (0-5). The participants experienced more than one emotion for each task. However, the results showed that for all the expressions except anger, the intended emotion was the one most frequently labeled by the participants. In the subjective experiment, five naive observers were asked to watch all the videos and label each video with two expressions for all the tasks. Each observer was also asked to label their confidence level of each choice using a 3-point scale indicating high, moderate, and low confidence. Their results showed that the intended emotion for each task was most frequently labeled by observers. Since participants’ self reports and observer ratings of each expression’s intensity were not released with the database, we analyzed the labeled action units to decide which source

56 videos we should include in our experiment. Five action units were released with the database: AU 6,10,12,14, and 17. For pain videos, we computed the average intensity of AU6 and AU10 across all the frames of the video and chose the five videos that had the highest average intensity across all their frames. We chose AU6 and AU10 because they are important in expressing pain [140]. Since AU10 is important in expressing disgust, we computed the average intensity of AU10 across all the frames of each disgust source video and chose the five videos that had highest average intensity across all their frames. For expressing anger, AU4 would have been ideal to use, but it was not available in the database. Therefore, we asked two naive observers to watch anger videos one by one and rated their perceived intensities using a Discrete Visual Analogue Scale (DVAS), which ranged from “none” to “very intense.” We chose the five anger videos that had the highest average intensities across the two observers.

5.2.4 Feature Extraction

For feature extraction we used the exact method described in Chapter 4. In summary, we used a CLM face tracker which is robust to light variations and head-pose rotations [12].

5.2.5 Humanoid Robot

We used Philip K. Dick (PKD), a humanoid robot modeled after the famous science fiction writer Philip K. Dick, who authored many well-known novels such as The man in the high castle and A scanner darkly. The robot was built by Hanson Robotics [77], and is an expressive robot with 24 DOFs on its face, head, neck. It has a Frubber skin, which is a flesh-like rubber that gives PKD a very realistic appearance [48, 171]. This spongy material used to make PKD’s skin can be stretched to mimic human-like facial expressions. PKD has 24 servomotors, which enables it to express a wide range of facial expressions, as well as head and eye movements [73]. These servo motors are connected to the skin through

57 tethers which push and pull the skin to create facial expressions. Figure 5.2 (right) shows PKD’s face.

5.2.6 Avatar Model Creation

To ensure a fair comparison of pain detection accuracy between the robot and the avatar, we created a similar virtual model of the robot PKD. This was to mitigate appearance dif- ference effects. To generate the avatar, we first extracted a base avatar model, “Breen”, from Source SDK [1], which enables FACS-based virtual character animation. We then used GCFScape to extract textures from the model, and used VTFEdit to convert the tex- tures to modifiable TARGA files (a raster graphic file format). We then had an artist use Adobe Photoshop to modify Breen’s texture to look like PKD. This included enlarging the jaw-line, cheekbones, and chin. Figure 5.2 shows the final avatar model side-by-side with PKD.

5.2.7 Stimuli Creation and Labeling

We initially created 30 stimuli videos. We had two types of embodiments (robot and avatar), three expression types (pain, anger, and disgust), and five videos of each expres- sion. In our work, we used the 66 CLM facial feature points that were tracked frame by frame from each of the 15 videos from the BP4D-Spontaneous database. We used these feature points as the raw features in our work. We first removed rotation, translation, and scaling based on the eye corner positions. Then, we measured the distance of each of the 66 CLM feature points to the tip of the nose as a reference point and saved the 66 distances in a vector. The tip of the nose stays static regardless of the expression one is expressing. In-plane rotations and translations do not affect distances of facial points to the tip of the nose. Therefore, any changes in the distance of a facial point to the tip of the nose means a change in facial expression is occurring. In each frame, we kept track of the changes each

58 Figure 5.2. Left: the created avatar model similar to PKD, right: PKD robot. facial point had to the tip of the nose. We then mapped these changes to servo motors of the PKD robot and control points of its virtual model. We performed this in three steps. First, for each of the PKD’s servo motors and each of the avatar’s control points we found a group of one or multiple CLM facial points whose movement would affect that motor or control point. For example, as PKD has two motors to control each of its eye- brows, one to control its inner eyebrow and another to control its outer eyebrow. However, as figure 5.3 shows, our face tracker tracks five points on each of the eyebrows (e.g. 46, 47, 48, 49, 50 for the left eyebrow). Therefore, we considered a group of points (48, 49, 50) for the left inner eyebrow and a group of points (46, 47, 48) for the left outer eyebrow. If a performer moves her left inner eyebrow (e.g. in anger expression), the distance of these three facial points (48, 49, 50) to the tip of the nose (32) decreases. Second, for each servo motor or control point, we averaged the movements of all the points in that group to calculate only one number as the command for that motor or control point. Servo motors of PKD and control points of the avatar have a different actuation range of values compared with feature points. Therefore, in the third step we created a conversion between these values. We found the minimum, maximum, and default values of each group of CLM facial points, each servo motor, and each control point. We then used linear

59 Figure 5.3. The CLM-based face tracker extracts 66 facial features. interpolation to map movements of each group of CLM facial points to each servo motor of PKD or control point of the avatar. We recorded PKD and its virtual model expressing each of the 15 source videos. After generating the 30 raw stimuli videos, we processed them before running our study. Stimuli videos had different lengths because their source videos from BP4D-Spontaneous database had different lengths. We cropped the videos to be three seconds long to ensure consis- tency. We added two seconds of a black screen with a white crosshair to the beginning of each video to prepare participants for the stimulus video. Then, a facial expression on PKD or the virtual character was presented for three seconds, followed by a black screen. Figure 5.4 shows example frames of the created stimuli videos.

5.2.8 Pilot Study

We ran an online pilot study to determine which stimuli videos should be included in our main study and to establish their ground truth labels, following established methods in the literature [121, 151]. 20 participants were recruited via Amazon Mturk (12 female and 8 male). Participants were all native English speakers, and aged between 23 to 46 years old (mean age 35.6 years old).

60 Figure 5.4. Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. From left to right: pain, anger, disgust. From top to bottom: source videos, PKD robot, and avatar.

Participants were asked to watch each video once, and label the expression they per- ceived and its intensity. The expression labels were semi-forced choice: pain, anger, dis- gust, and none of the above. We employed this paradigm to align with the methodological labeling approach proposed by Tottenham et al. [181]. They suggested that a semi-forced choice method is less strict than a forced choice method [157], and that it is easier to inter- pret findings in the semi-forced choice paradigm compared with a free-choice paradigm. For labeling the intensity of perceived expression in each video, we used a 6-point DVAS. We decided to have an even number of choices (0 to 5) for intensity, because oth- erwise participants tend to always pick the middle if there are an odd number of choices.

61 Zero was associated with none and five was associated with very intense. We calculated the accuracy of each of our videos across our participants, and chose the three best videos of each expression that had the highest average accuracy across the two embodiments (robot and avatar). We had average accuracies of 82%, 65%, and 67% for pain, 62%, 35%, and 32% for anger, and 27%, 25%, and 17% for disgust. We were not surprised by the low detection accuracies for disgust, since it is known to be a poorly distinguishable facial expression in the literature [14, 17, 23]. One reason for low detection accuracies for anger was that the source videos came from a naturalistic dataset (BP4D) and therefore had very low intensities. Zhang et al. designed an activity for simulating different natural expressions including anger and disgust in their participants [191]. Therefore, the naturally stimulated anger videos had very low intensity compared to an acted dataset such as the one we used in our previous work [121]. Naturalistic anger source videos were hard to recognize due to their very low intensity. However, since our research questions were focused on pain, we decided to continue with the study and include anger and disgust in addition to pain to make sure we do not ask participants to only watch and label only one type of expression (pain).

5.2.9 Main Experiment

Following the pilot study, we selected three videos of each expression to use in our main experiment, resulting in a total of 18 stimuli videos. The videos were prepared and presented in the same format as our pilot and randomized accordingly. Participants were asked to select a semi-forced choice expression label (pain, anger, disgust, none of the above) and expression intensity (on a 6-point DVAS scale). Our main experiment was a video-based study. In HRI, these studies are commonly employed in order to include a larger number of participants [185], and to reach a more

62 diverse audience. In video-based HRI studies, participants typically view videos of human- robot interactions and answer questions about them. In our work, instead of watching a human and robot interact, participants watched a robot or its avatar model expressing pain, anger, and disgust, and were asked to label the perceived expression and its intensity. While some researchers argue that video-based studies are not suitable for answer- ing some research questions considering that participants do not actually interact with the robot, other HRI research findings suggest that video-based studies have comparable re- sults to live studies [95, 97, 135, 182, 188]. For example, Kidd did not find any significant differences between subjects’ ratings of personality traits for live and video-based studies of an interaction with a robotic head [95]. Furthermore, Kiesler et al. studied the subjec- tive perception of live and recorded agents (robots and avatars) and found the results were similar [97]. As more interaction is involved between robot and participant in a study, the less suit- able a video-based study would be because of the increased importance of aspects of em- bodiment [182]. Because our work is focused on how people perceive robot expressions, with a particular focus on validating our synthesis algorithms and also on their affect on clinical trainees, we hoped to still gain valuable findings about visual pain perception on an RPS at this stage of our research program. Furthermore, we aimed to recruit a large number of interprofessional clinicians from across the United States which would have been very challenging if we wanted to bring all of them to our lab. A future experiment can be replicating with a live study. We recruited 102 participants for our experiment, 51 clinicians and 51 non-clinicians. We recruited clinicians by sending an advertisement email to over a dozen different med- ical and nursing schools across the United States. The clinicians were from multidisci- plinary backgrounds, 37% were registered nurses (RNs), 41% were physicians, and the rest included pharmacists, medical technicians, clinical researchers, and therapists (ap- proximately 20%).

63 Lay participants were recruited via Amazon Mturk. Participants were eligible only if they did not participate in our previous studies, were over 18, and were native English speakers. Additionally, each participant completed a demographic questionnaire at the beginning of the study. In this questionnaire, participants were asked if they ever received a clinical education, or if they ever held a clinical related job. If they answered affirmatively, they were unable to continue the study. Overall, participant ages ranged from 19-64 (mean age = 33.6 years) and they were of mixed heritage, and were all native English speakers. In terms of gender, the clinicians were 39% male, and the non-clinicians were 54% male. Each participant watched the 18 videos (6 pain, 6 anger, and 6 disgust) in a random order. Nine videos were from the robot and nine were from the avatar. Participants were asked to watch each video once and label the expression that they perceived and its inten- sity. Expressions were semi-forced choice: pain, anger, disgust, and none of the above. For labeling the intensity of perceived expression in each video, we used a 6-point DVAS (0 (none) to 5 (very intense)). The results from the main experiment are described in the subsequent sections. We measured accuracy (correct or incorrect) across our three independent variables (embodi- ment type, educational background, and expression type). We describe the statistical de- tails of our analysis below.

5.3 Results

5.3.1 Regression Method

To answer the first and second research questions, we used binary logistic regression. We had accuracy as the one dependant variable derived from the expression classification task (i.e. classification of the expressions as pain, anger, or disgust). Accuracy is based on the ground truth that we gained from our pilot studies. We had three independent categorical variables: expression with three levels (pain, anger, and disgust), embodiment

64 type with two levels (robot and avatar), and clinical education background with two levels (clinician and non-clinician). The dependent variable was analyzed using a within-subjects binary logistic regression since the only dependant variable was binary (1: accurate, 0: inaccurate). In the following analyses, significant effects are those with p-values <.05. Table 5.1 shows the details re- garding the exact number of errors that participants made in the classification of 18 videos. In this classification, we considered an answer correct if the participant’s label matched with the source video label. The percentage of correct classifications was computed across each of the three expressions types within each embodiment and each education back- ground : 3 (Expression: pain, anger, or disgust) × 2 (Embodiment: robot or avatar)× 2 (Background: clinician or non-clinician). The three independent variables (expression, embodiment type, and clinical education background) were significant predictors for the dependent variable (see Table 5.3). We compared the full model with three predictors (expression, embodiment, and background) with the restricted model with just a constant factor. The results of the analysis of the full model with three predictors (independent variables) suggest a significant effect of the set of predictors on the correct identification rate as the dependant variable.

The Wald test in Table 5.3 shows the degree to which each expression affected accuracy. While the chi-square value in Table 5.2 shows that the predictors together have a significant effect on the model, the Wald test is the significant test for each individual predictor. Table 5.3 shows the effect of each individual independent variable on the classification rate. The standardized Beta value represents the weight that each predictor has in the final model. Negative weight shows a negative relation. Since the regression was run with pain as the reference value, it does not have a Beta value. The Wald test suggests that all the three expressions are significant predictors for accuracy, (p < .001). Medical background is also a significant predictor for accuracy, W = 8.92, p < .05. Embodiment type is a significant predictor for accuracy as well, W = 8.33, p < .05.

65 TABLE 5.1

FREQUENCIES AND PERCENTAGES OF HITS AND ERRORS IN THE MAIN STUDY

Overall Avatar Robot

Clinician Non-clinician Clinician Non-clinician Clinician Non-clinician

Pain

Total number of responses 306 306 153 153 153 153

Correct answers 158 (51.63%) 205 (66.99%) 85 (54.48%) 127 (83%) 73 (47.71%) 78 (50.98%)

Judged as anger 68 (22.22%) 15 (4.90%) 35 (22.43%) 9 (5.88%) 33 (21.56%) 6 (3.92%)

Judged as disgust 52 (16.99%) 35 (11.43%) 26 (16.66%) 10 (6.53%) 26 (16.99%) 25 (16.33%)

Judged as none of the above 28 (9.15%) 51 (16.66%) 7 (4.48%) 7 (4.57%) 21 (13.72%) 44 (28.75%)

Anger

Total number of responses 306 306 153 153 153 153

Correct answers 89 (29.08%) 104 (33.98%) 51 (33.33%) 69 (45.09%) 38 (24.83%) 35 (22.87%)

Judged as pain 89 (29.08%) 57 (18.62%) 26 (16.99%) 18 (11.76%) 63 (41.17%) 39 (25.49%)

Judged as disgust 67 (21.89%) 70 (22.87%) 34 (22.22%) 18 (11.76%) 33 (21.56%) 52 (33.98%)

Judged as none of the above 61 (19.93%) 75 (24.50%) 42 (27.45%) 48 (31.37%) 19 (12.41%) 27 (17.64%)

Disgust

Total number of responses 306 306 153 153 153 153

Correct answers 73 (23.85%) 73 (23.85%) 30 (19.60%) 18 (11.76%) 43 (28.10%) 55 (35.94%)

Judged as pain 101 (33.00%) 98 (32.02%) 53 (34.64%) 62 (40.52%) 48 (31.37%) 36 (23.52%)

Judged as anger 71 (23.20%) 47 (15.35%) 33 (21.56%) 25 (16.33%) 38 (24.83%) 22 (14.37%)

Judged as none of the above 61 (19.93%) 88 (28.75%) 37 (24.18%) 48 (31.37%) 24 (15.68%) 40 (26.14%)

66 TABLE 5.2

OMNIBUS TESTS OF MODEL COEFFICIENTS

Chi-square df Sig.

Step 198.035 4 0.000

Step 1 Block 198.035 4 0.000

Model 198.035 4 0.000

5.3.2 ANOVA Method

Our third research question explored the effect of a clinical education background on the perceived intensity of pain. To answer this question, we ran a multi-factor ANOVA. We had one dependent variable in this test, intensity, which was a scalar variable between 0 and 5. We had two categorical independent variables, clinical education background with two levels (clinician and non-clinician), and embodiment type with two levels (robot and avatar). Table 5.4 shows that a clinical background is not a significant predictor for the per- ceived intensity. Therefore, we found that there is no significant difference between clini- cians and non-clinicians in overall pain intensity perception. However, Table 5.4 shows that embodiment type is a significant predictor for perceived pain intensity. Therefore, we found that participants tend to perceive a lower intensity of pain expressed by a robot compared to its similar avatar.

67 TABLE 5.3

INDEPENDENT VARIABLES IN THE REGRESSION EQUATION AND THEIR INDIVIDUAL EFFECTS ON THE CLASSIFICATION TASK

B S.E Wald df Sig. Exp(B) 95% C.I.for EXP(B)

Lower Upper

Step 1 Embodiment -0.294 0.102 8.337 1 0.004 0.745 0.610 0.910

Education background -0.304 0.102 8.928 1 0.003 0.738 0.604 0.901

Pain 172.874 2 0.000

Anger 1.553 0.126 151.089 1 0.000 4.727 3.690 6.056

Disgust 0.381 0.129 8.683 1 0.003 1.464 1.136 1.886

Constant -0.873 0.116 56.372 1 0.000 .418

TABLE 5.4

UNIVARIATE ANALYSIS OF VARIANCE (TESTS OF BETWEEN SUBJECTS EFFECTS)

Sum of squares df Mean square F Sig.

Corrected model 56.660 3 18.88 11.49 0.000

Intercept 5568.16 1 5568.16 3388.23 0.000

Embodiment 45.02 1 45.02 27.39 0.000

Education background 3.16 1 3.16 1.92 0.16

Embodiment * Education background 8.47 1 8.47 5.15 0.02

Error 999.17 608 1.64

Total 6624.00 612

Corrected total 1055.83 611

68 5.4 Discussion

We synthesized naturalistic painful facial expressions on a humanoid robot and its vir- tual model, and answered different research questions regarding their perception [124]. This work explores new applications of social robots in patient simulation, and gives some insights on how simulator type (avatar or robot) or educational background (clinician or non-clinician) may affect perception of pain. We found support for our first research question, which supports earlier findings in the literature, that clinicians are less accurate than laypersons in decoding pain [45, 63, 72, 83, 107, 126]. Our work showed that this effect is found regardless of simulation embodiment (robot or virtual character), which suggests even the novelty of the embodiment is not sufficient to incur improved decoding skills. However, since training has been shown to help mitigate these clinical habituation effects, expressive, interactive RPSs may be useful in the medical education curriculum to be used alongside procedural skills training. For example, rather than having infrequent and contextless patient empathy training sessions [161], clinical learners could be continually learning pain decoding skills as they practice other critical care skills on patient simulators (e.g., performing physical exams, assessing patient state, etc.). Some of these applications have recently been explored in the HRI community. For ex- ample, using robots for a range of health communication interventions to empower people with facial masking or facial movement disorders (e.g., Parkinson’s disease, cerebral palsy, dystonia) when interacting with their clinicians [36, 146, 149, 180]. The next chapter in fact explores masked synthesis on RPSs, and believe these kinds of interventions could also be useful in a clinical education context to help with destigmatization and to improve provider empathy. We did not find support for our prediction that participants with a clinical education background tend to miss lower levels of pain intensity compared to laypersons. This is in contrast to findings by Prkachin et al. [45, 139, 141] who showed that clinicians tend to

69 vastly underestimate patient pain intensity compared to laypersons. One reason for this finding could involve the type of pain studied. Prior work has focused on chronic pain perception by clinicians; however, we employed acute pain models in our work. Further work is needed to explore this finding; preferably longitudinally in situ in clinic. It also may be interesting to explore if clinicians underestimate pain intensity on ex- pressive RPSs when patient history is provided to the clinical learner. It is well docu- mented that clinicians underestimate expressed pain in patients who exhibit risky lifestyle behaviors (e.g., drug use), those that come from low socio-economic strata, culturally- underrepresented minorities, older adults, and wo-men. Lok et al. [184] recently showed that clinicians can achieve self-awareness of their pain perception biases within this con- text, which suggests that there may be a place for RPS training programs to help mitigate these effects. We also found that all participants (clinicians or laypersons) are more accurate in de- tecting pain from a virtual avatar compared to a similar robot, and able to perceive higher pain intensities on the avatar. These results lend support to some recent findings in the HRI community. For example, Bennett and Sabanovic [20] found that detection accuracy is higher for facial expressions on an avatar compared to a physical robot. Also, Li et al. [109] found that when a human lecturer giving a presentation was replaced with an expres- sive virtual avatar, knowledge recall was higher than when the lecturer was replaced by an expressive social robot. Bennett and Sabanovic argued that the higher detection accuracy finding for virtual avatars might be because it is easier to maintain FACS fidelity on an avatar compared to a physical robot [20]. This may also be true for our findings, however, we conducted dozens of small pilot studies to ensure FACS similarity and fidelity between the robot and the virtual character. However, it is possible subtle expressivity differences existed between the two embodiments that were too subtle to detect. Additional follow-on studies are required to further explore this idea.

70 This finding relates to a limitation of our work, which was that PKD did not have any range of motion nor wrinkles around the nose area or cheeks. Therefore we could not map AU9 or AU10, which are both important for expressing pain and disgust [140]. (We also disabled these degrees of freedom in the virtual avatar to ensure a fair comparison). Adding these action units to both the robot and avatar could help improve overall detection accuracy and intensity perception. In the development of our own robot, we have used these findings to improve its design. For example, it is now clear to us that AU9, AU10, and AU4 are critical action units to have on our robot. Upon its completion, further studies can be conducted to explore the implications of these AUs, and also replicate this experiment to see if our results differ. Another reason for low detection accuracies of some facial expressions may be because our videos came from a naturalistic dataset (BP4D). It is well understood in the affective computing community that naturally evoked emotions overall have far lower intensities compared to acted datasets (such as the MMI corpus [137]). It is worth noting that the recognition rate itself is not the key finding with regards to the path to applicability - it is the significant difference in accuracy between clinicians (51%) and non-clinicians (67%). This result provides direct support to other findings in the clinical literature that as clinicians progress through their training they lose their expression decoding abilities, which affects how well they can provide care. This research offers a new paradigm for training that may enable clinicians to overcome these decoding deficits through repeated interactions and exposure. Expressive RPSs are perfect for providing such training. Despite their high clinical relevance, commercially available RPSs do not currently support human-like face-to-face interactions with learners. However, this is problematic as clinicians need to learn how to start reading patient facial expressions from their very first day of training, and continue to hone their facial expression decoding skills throughout their training. These face-to-face communication skills should not be kept separate from

71 other beside training skills (e.g., taking vital signs, inserting IV lines), but rather should be well-ingrained into the simulation curriculum in a manner that maintains simulation fidelity. This work has implications for the HRI community because it explores a novel appli- cation of facially expressive robots in healthcare which is using these robots as expressive patient mannequins. We also explored how synthesizing facial expressions on robotic pa- tient simulators can make their interactions with clinical learners more realistic. Our work can help design more effective social robots for healthcare applications by studying how these robots are perceived by different groups, and how different factors, such as physical embodiment or educational background, can affect perception of these robots. The clinical community can benefit from this work because we are developing new training tools for clinical learners to help them improve their skills in reading patients’ faces. Our results supported findings by other researchers that compared to lay partici- pants, clinicians have a lower accuracy in detecting pain in patients. However, to the best of our knowledge our work is the first to replicate these findings through simulation with an expressive robot instead of a real patient. Considering that both robotic patient man- nequins and virtual avatars can be used as part of a training tool for improving learners’ skill in decoding visual signals of illness and pain in patients, our results can help clinical professionals understand how choosing different simulation systems (virtual or robotic) can affect perception of expressions.

5.5 Chapter Summary

In this chapter we discussed our work on synthesizing patient driven facial expressions of pain, anger, and disgust onto a humanoid robot and its virtual model. To synthesize pain, we used a performance-driven synthesis method as discussed in Chapter 2, and evaluated it using a subjective evaluation discussed in Chapter 3. We found support for earlier findings in the literature, that clinicians are less accurate than laypersons in decoding pain. We

72 also found that all participants (clinicians or laypersons) are more accurate in detecting pain from a virtual avatar compared to a similar robot, and able to perceive higher pain intensities on the avatar. In next chapter, I present our work on development of an ROS module for synthesizing facial expressions on a wide range of robots and virtual avatars.

5.6 Acknowledgment of Work Distribution

I thank Sumit Das and Dan Popa from the University of Louisville for their contribution to creating the stimuli videos for the study described in this chapter. I also thank Michael Gonzales for his help in creating the avatar model for the study described in this chapter. I also would like to thank Dr. Roya Ghiassedin for providing input for the statistical methods used in Section 7.6.6 of this chapter.

73 CHAPTER 6

DESIGNING AND CONTROLLING AN EXPRESSIVE ROBOT

HRI researchers have used a range of facially expressive robots in their work, such as the ones shown in figure 2.1. These robots offer a great range in their expressivity, facial degrees-of-freedom (DOF), and aesthetic appearance. Because different robots have different hardware, it is challenging to develop transferable software for facial expression synthesis. Currently, one cannot reuse the code used to synthesize expressions on one robot’s face on another [6]. Instead, researchers are developing their own software systems, which are customized to their specific robot platforms, reinventing the wheel. Another challenge in the community is many researchers need to hire animators to generate precise, realistic facial expressions for their robots. This is very expensive in terms of cost and time, and is rather inflexible for future research. A few researchers use commercial off-the-shelf systems for synthesizing expressions on their robots, but these are typically closed source and expensive as well. Thus, there is a need in the community for an open-source software system that enables low-cost, realistic facial expression synthesis. Regardless of the number of a robot’s facial DOFs, from EDDIE [168] to Geminoid F [18], the ability to easily and robustly synthesize facial expressions would be a boon to the research community. Researchers would be able to more easily implement facial expressions on a wide range of robot platforms, and focus more on exploring the nuances of expressive robots and their impact on interactions with humans and less on laborious animation practices or the use of expensive closed-source software. In this chapter, we describe a generalized software framework for facial expression

74 synthesis. To aid the community, we have implemented our framework as a module in the Robot Operating System (ROS). Our synthesis method is based on performance-driven animation, which we discussed in Chapter 2. However, in addition to enabling live pup- peteering or “play-back”, our system also provides a basis for more advanced synthesis methods, like shared gaussian process latent variable models or interpolation techniques [111][54]. We describe our approach and its implementation in Section 6.1, and its validation in both simulation and on a multi-DOF robot in Section 6.2. Our results show that our framework is robust to be applied to multiple types of faces, and we discuss these findings in Section 6.3.

6.1 Proposed Method

Our model is described in detail in the following sections, but briefly our process was as follows: We designed an ROS module with five main nodes to perform performance driven facial expression synthesis for any physical or simulated robotic face. These nodes include:

S, a sensor, capable of sensing the performer’s face (e.g., a camera)

P, a point streamer, which extracts some facial points from the sensed face

F, a feature processor, which extracts some features from the facial points coming from the point streamer

T, a translator which translates the extracted features from F to either the servo motor commands of physical platforms or the control points of a simulated head

C : C1 ...Cn, a control interface which can be either an interface to control the animation of a virtual face or motors on a robot. These five nodes are the main nodes for our synthesis module. However, if desired, a new node can be added to generate any new functionality.

75 Figure 6.1: Overview of our proposed method.

Figure 6.1 gives an overview of our proposed method. Assume one has some kind of sensor, (S), which senses some information from a person’s (pr) face. This information might consist of video frames, facial depth, or output of a marker/markerless tracker. pr can be either a live or recorded performer. In our general method, we are not concerned about identifying the expressions on the pr’s face. We are concerned about how to use the expressions to perform animation/synthesis on the given simulated/physical face. S senses pr and we aim to map the sensed facial expressions onto the robot’s face. Basically, we use a point streamer P, to publish information from a provided face. Any other ROS node can subscribe to the point streamer to synthesize expressions for an simulated/physical face. A feature processor F, subscribes to the information published by the point streamer and processes this information. F extracts useful features out of all of the facial information published by the point streamer. Then, a translator, T, translates extracted features to control points of a physical/simulated face. Finally, a control interface

C : C1 ...Cn moves the physical/simulated face to a position which matches pr’s face.

6.1.1 ROS Implementation

Figure 6.1 depicts the required parts for our proposed method. The software in our module is responsible for three tasks: (1) Obtaining input, (2) Processing input, (3) Actu- ating motors/control points accordingly.

76 These responsibilities are distributed over a range of hardware components; in this case, a webcam, an Arduino board, a servo shield, and servo motors. A local computer performs all processing tasks and collects user input. The data is then passed to the control interface, C : C1 ...Cn which can either move actuators on a physical robot or control points on a virtual face. While figure 6.1 shows the most basic version of our system architecture, other functionality or services can be added as nodes. Below, we describe each of these nodes in detail as well as the ROS flow of our method. S, the sensor node, is responsible for collecting and publishing the sensor’s information. This node organizes the incoming information from the sensor and publishes its message to the topic /input over time. The datatype of the message that this node publishes depends on the sensor. For example, if the sensor is a camera, this node publishes all incoming camera images. Examples of possible sensors include a camera, a Kinect, or a motion capture system. This node can also publish information from pre-recorded data, such as all frames of a pre-recorded video. P, the point streamer node, subscribes to the topic /input and extracts some facial points from the messages it receives. This node extracts some facial points and publishes them to the topic /points. F, the feature processor node, subscribes to the topic /points. Node F processes all the facial points published by P. F extracts useful features from these points that can be used to map the facial expressions of a person to the physical/simulated face. This node publishes feature vectors to the topic /features. T, the translator node, subscribes to the topic /features and translates the features to DOFs available on the robot’s face. Basically, this node processes the message received from the topic /features and produces corresponding movements for each of the con- trol points on a robotic or virtual character face. This node publishes its output to the topic /servo commands.

C : C1 ...Cn: The control interface node subscribes to the topic /servo commands

77 and actuates the motors of a physical robot or control points of a simulated face. We show

C : C1 ...Cn because a control interface might consist of different parts. For example, in case of a physical robotic head, the control interface might include a microcontroller, a servo shield, etc. We show the combination of all of these pieces as a single node be- cause they cooperate together to actuate the motors. C : C1 ...Cn subscribes to the topic /servo commands which contains information about the exact movement for each of the control points of the robotic/simulated face. This node then makes a readable file for the robot containing the movement information and sends it to the robot.

6.1.2 An Example of Our Method

There are various ways to implement our ROS module. In our implementation in ROS, we used a webcam as S. We chose the CLM face tracker as P. In F, we measured the movement of each of the facial points coming from the point streamer over the time. In T, we converted the features to servo commands for the physical robot and slider movements of the simulated head. In C, we used an Arduino Uno and a Renbotic Servo Shield Rev2 for sending commands to the physical head. For the simulated faces, C generates source files that the Source SDK was capable of processing. We intended to use this implementation in two different scenarios: a physical robotic face as well as a simulated face. For a physical robot, we used our bespoke robotic head with 16 servo motors. For a simulated face, we used “Alyx”, an avatar from video game Half-Life 2 from the Steam Source SDK. We describe each subsystem in detail in the following subsections. Point streamer P: We employed a Constrained Local Model (CLM)-based face tracker as the point streamer in our example implementation. As we described in Chapter 2, CLMs are person-independent techniques for facial feature tracking similar to Active Appearance Models (AAMs), with the exception that CLMs do not require manual labeling [46]. In our work, we used an open source implementation of CLM developed by Saragih et al.

78 [41, 160]. We ported the code to run within ROS. In our implementation, ros clm (the point streamer) is an ROS implementation of the CLM algorithm for face detection. The point streamer ros clm publishes one custom message to the topic /points. This message to the topic includes 2D coordinates of 66 facial points. This message is used to stream the CLM output data to anyone who subscribes to it. As shown in the figure 6.1, when the S node (webcam) receives a new image, it pub- lishes a message containing the image data to the /input topic. The master node then takes the message and distributes it to the P node (ros clm) because it is the only node that subscribes to the /input topic. This initiates a callback in the P ros clm node, causing it to begin processing the data which is basically tracking a mesh with 66 facial points over time. The ros clm node sends its own message on the /points topic with the 2D coordinates of the 66 facial feature points. Feature processor F: The F node subscribes to the topic /points. The F node receives these facial points. Using the position of two eye corners, F removes the effects of rotation, translation, and scaling. Next, in each frame, F measures the distance of each facial point to the tip of the nose as a reference point and saves 66 distances in a vector. The tip of the nose stays static in transition from one facial expression to the other. If the face has any in-plane translation or rotation, the distances of facial points from the tip of the nose will not be affected. Therefore, any change in the distance of a facial point relative to the tip of the nose point over time would mean a facial expression is occurring. F publishes its calculated features to the topic /features. Translator T: The T node subscribes to the topic /features and produces appro- priate commands for servos of a physical robot or control points of a simulated face. F keeps track of any changes in the distances of each facial point to the tip of the nose and

79 Figure 6.2: Left: the 66 facial feature points of CLM face tracker, Right: an example robotic eyebrow with one degree of freedom and an example robotic eyeball with two degrees of freedom publishes them to the /features topic. The T node has the responsibility of mapping these features to corresponding servo motors and servo ranges of a physical face, or to the control points of a simulated head. T performs this task in three steps. The general idea of these steps is similar to the steps Moussa et al. used to map the MPEG-4 Facial Action Parameters of a virtual avatar to a physical robot [127]. In the first step, for each of the servo motors, we found a group of one or multiple CLM facial points whose movement significantly affected the motor in question. For ex- ample, as Figure 6.2 shows, the CLM tracker tracks five feature points on left eyebrow (22,23,24,25,26). However, the robot face shown in Figure 6.2 has only one motor for its left eyebrow. Therefore, the corresponding feature group for the robot’s left eyebrow, would be 22,23,24,25,26. T converts the movement of each group of the CLM feature points to a command for the corresponding servo motor of a physical robot or control point of a simulated face. We used two examples in this paper, one with a simulated face and one with our bespoke robot. As an example, Table 6.1 shows the corresponding group of CLM points for each of the 16 servo motors of our bespoke robot.

80 We averaged the movements of all of the points within a given group to compute only one number as the command for each motor/control point. To demonstrate this principle, our bespoke robot has a single motor for the right eyebrow. However, as Figure 6.2 shows, the CLM face tracker tracks five feature points on right eyebrow. If a performer raises their right eyebrow, the distance of these five points to the tip of the nose increases. We average the movements of these five points and use that value to determine the servo command for the the robot’s right eyebrow. Servo motors have a different range of values than that of feature points. Therefore, in the second step, we created a conversion between these values. The servos in our robot accept values between 1000 and 2000. To find the minimum and maximum movement of each group of points associated with each servo, we asked a performer to make a wide range of extreme facial movements while seated in front of a webcam connected to a computer running CLM. For example, we asked the performer to raise their eyebrows to their extremities, or open their mouth to its max- imum. Then, we manually modified the robot’s face to match the extreme expressions on the subject’s face and recorded the value of each motor. This way, we found the minimum and maximum movement for each group of facial feature points as well as for each servo motor. In the last step, we mapped the minimum, maximum, and default values of the CLM facial points and the servo motors. Some servo motors had a reversed orientation with the facial points. For those servos, we flipped the minimum and maximum. In order to find values for a neutral face, we measured the distance of feature points to the tip of the nose while the subject had a neutral face. We also manually adjusted the robot’s face to look neutral and recorded servo values. Using the recorded maximum and minimum values, we applied linear mapping and interpolation to find the criteria of mapping facial distances to servo values [127]. These criteria are used to translate facial points in each unseen incoming frame to the robot’s servo

81 values. The T node publishes a set of servo values to the topic /servo commands.

Control interface C: C1 ... Cn : The C node subscribes to the topic /servo commands and sends the commands to the robot. The servo motors of our robot are controlled by an interface consisting of an Arduino UNO connected to a Renbotic Servo Shield Rev2. ROS has an interface that communicates with Arduino through the rosserial stack [155]. By using rosserial arduino, a subpackage of rosserial, one can add libraries to the Arduino source code to integrate Arduino-based hardware in ROS. This allows communication and data exchange between the Arduino and ROS. Our system architecture uses rosserial to publish messages containing servo motor commands to the Arduino in order to move the robot’s motors. The control interface re- ceives the desired positions for the servo motors at 24 frames-per-second (fps). For sending commands to the simulated face, C generates source files that the simulated face is capable of processing.

6.2 Validation

To ensure our system is robust, we performed two evaluations. First, we validated our method using a simulated face (we used “Alyx”, an avatar in the Steam Source SDK). Then, we tested our system on a bespoke robot with 16 DOFs in its face.

6.2.1 Simulation-based Evaluation

We conducted a subjective evaluation (perceptual experiment) in simulation to validate our synthesis module. This is a common method for evaluating synthesized facial expres- sions [18, 116]. Typically, participants observe synthesized expressions and then either answer questions about their quality or generate labels for them. By analyzing collected answers, researchers evaluate different aspects of the expressions of their robot or virtual avatar.

82 TABLE 6.1

THE FACIAL PARTS ON THE ROBOT, AND CORRESPONDING SERVO MOTORS, AND CLM TRACKER POINTS

Facial Part Servo Motor # CLM Points

Right eyebrow 1 17,18,19,20

Left eyebrow 2 23,24,25,26

Middle eyebrow 3 21,22

Right eye 4 (x direction), 5 (y direction) 37,38,40,41

Left eye (x and y direction) 6 (x direction), 7 (y direction) 43,44,46,47

Right inner cheek 8 49,50

Left inner cheek 9 51,52

Right outer cheek 10 49,50,51

Left outer cheek 11 51,52,53

Jaw 12 56,57,58

Right lip corner 13 48

Left lip corner 14 54

Right lower lip 15 57,58

Left lower lip 16 55,56

83 Method: In Chapter 4, we explained our perceptual study in detail. We extracted three source videos of pain, anger, and disgust (total of nine videos) from the UNBC-McMaster Pain Archive [113] and MMI database [137], and mapped them to a virtual face. Using our synthesis module, we mapped these nine facial expressions to three virtual characters from the video game Half-Life 2 from Steam Source SDK. We used three differ- ent virtual avatars, and overall we created 27 stimuli videos: 3 (Expression: pain, anger, or disgust) × 3 (Gender: androgynous, male, and female). Figure 6.3 shows example frames of the created stimuli videos. In order to validate people’s ability to identify expressions synthesized using our performance- driven synthesis module, we conducted an online study with 50 participants on Amazon MTurk. Participant’s ages ranged from 20-57 (mean age = 38.6 years). They were of mixed heritage, and had all lived in the United States for at least 17 years. Participants watched the stimuli videos in randomized order and were asked to label the avatar’s expression in each of the 27 videos. Results and discussion: We explained our findings in detail in Chapter 4. But in summary, we found that people were able to identify expressions when expressed by a simulated face using our performance-driven synthesis module (overall accuracy: 67.33%, 64.89%, and 29.56% for pain, anger and disgust respectively) [121]. Low disgust accura- cies are not surprising; it is known to be a poorly distinguishable in the literature [17]. Riva et al. manually synthesized painful facial expressions on a virtual avatar with the help of facial animation experts, and found 60.4% as the overall pain labeling accuracy rate [152]. Although we did not set out to conduct a specific test to compare our findings to those of manual animation of the same expressions (c.f. Riva et al.), we found our synthesis method achieved arithmetically higher labeling accuracies for pain. These results are encouraging, and suggest that our synthesis module is effective in conveying realistic expressions. The next evaluation is to see how well it does on a robot.

84 Figure 6.3. Sample frames from the stimuli videos and their corresponding source videos, with CLM meshes. The pain source videos are from the UNBC-McMaster pain archive [113]; the others are from the MMI database [137].

6.2.2 Physical Robot Evaluation

To test our synthesis method with a physical robot, we used a 16-facial-DOF bespoke robot. Evaluating facial expressions on a physical robot is more challenging than on a sim- ulated face because their physicality changes the physical generation of synthesis. Moving motors in real-time on a robot is far more complex a task due to the number of a robot’s motors, their speed, and their range of motion. We needed to understand if our robot’s motors were moving in real time to their in- tended positions. Since the skin of our robot was still under development, we did not run a complete perceptual study similar to the one we ran in simulation. However, as we were testing how the control points on the robot’s head moved in a side-by-side comparison to a person’s face, we do not believe this was especially problematic for this evaluation. Method: We ran a basic perceptual study with 12 participants to test both the real-time nature of the system, and the similarity between expressions of a performer and the robot. We recorded videos of a human performer and a robot mimicking the performer’s face.

85 The human performer could not see the robot. However, facial expressions made by the performer were transferred to the robot in real time. The performer sat in front of a webcam connected to a computer. During the study, the performer was instructed to perform 10 face-section expressions, two times each (yield- ing a total of 20 videos). The computer instructed the performer to express each of the face-section expressions step by step. Face-section expressions were: neutral, raise eye- brows,frown, look right, look left, look up, look down, raise cheeks, open mouth, smile. We recorded videos of both the performer and the robot mimicking the performer’s face. Each video was between 3-5 seconds in length. We ran a basic perceptual study by using side-by-side comparison or “copy synthesis”, which we have described in Chapter 3. In a side-by-side comparison, one shows synthesized expressions on a simulated/physical face side-by-side with the performer’s face to participants, and asks them to answer some questions [3, 190]. We showed side-by-side face-section videos of the performer and the robot to partic- ipants. Participants viewed the videos in a randomized order. We asked participants to rate the similarity to and synchrony with the performer’s expressions and the robot expres- sions through use of a 5-point Discrete Visual Analogue Scale (DVAS). A five on the scale corresponded to “similar/synchronous” and a one to “not similar/synchronous” . Results and discussion: Participants were all American and students at University of Notre Dame. Their ages ranged from 20-28 years old (mean age = 22 years). Eight female and four male students participated. The overall average score for similarity between the robot and the performer expressions was 3.85 (s.d. = 1.28). The overall average score for synchrony between the robot and performer expressions was 4.30 (s.d. = 1.01). Table 6.2 reports the full results for each of the 10 face-section expressions. The rela- tively high overall scores of similarity and synchrony between the performer and the robot expressions suggest that our method can accurately map facial expressions of a performer onto a robot in real-time. However, as this table shows, we had a low average similarity

86 TABLE 6.2

FULL RESULTS FOR EACH OF THE 10 FACE-SECTION EXPRESSIONS

Face-section expression Average similarity score s.d Average synchrony score s.d

Neutral 4.12 1.07 4.16 1

Raise eyebrows 4.33 0.86 4.25 1.13

Frown 4 1.02 4.08 1.24

Look right 4.54 0.5 4.66 0.74

Look left 4.5 0.77 4.37 1.08

Look up 2.83 1.29 3.7 1.31

Look down 3.54 1.41 4.45 0.77

Raise cheeks 3.79 1.4 4.25 0.85

Open mouth 4.12 1.16 4.62 0.56

Smile 2.79 1.14 4.41 0.91

Overall 3.85 1.28 4.3 1.01 score for lookup and smile. One reason might be that the CLM tracker that we used in our experiment does not accurately track vertical movements of the eyes. Therefore, we could not accurately map the performer’s vertical eye movements to the robot. Also, at the time of this experiment our robot still did not have skin so its lips did not look very realistic. This might be a reason why participants did not find the robot’s lip movements to be similar to the performer’s lips movements.

6.3 General Discussion

In this chapter, we described a generalized solution for facial expression synthesis on robots, its implementation in ROS using performance-driven synthesis, and its successful

87 evaluation with a perceptual study. Our method can be used both to map facial expressions from live performers to robots and virtual characters, as well as serve as a basis for more advanced animation techniques. Our work is robust, not limited by or requiring a specific number of degrees of freedom. Using ROS as an abstraction of the code, other researchers may later upgrade the software and increase functionality by adding new nodes to our ROS module. Our work is also a benefit to the robotics, HRI, and affective agents communities, as it does not require a FACS-trained expert or animator to synthesize facial expressions. This will reduce researchers’ costs and save them significant amounts of time. One limitation of our work was that we could not conduct a complete evaluation of our work on a physical robot, since its skin was still under development. Once the robot’s skin is completed, we will run a full perceptual test. A second limitation was that the eye- tracking capabilities in CLM are poor, which may have caused the low similarity scores between the robot and performer. In the future, as eye tracking technology advances (such as with novel, wearable cameras), we look forward to conducting our evaluation again.

6.4 Chapter Summary

Robots that can convey intentionality through facial expressions are desirable in HRI since these displays can lead to increased trust and more efficient interactions with users. Researchers have explored this domain of research, though in a somewhat fragmented way due to variations in robot platforms that require custom synthesis software. In this chapter, we introduced a real-time platform-independent framework for synthesizing facial expressions on both virtual and physical faces. The best of our knowledge, our work is the first attempt to develop an open-source generalized performance-driven facial expression synthesis system. In next chapter, I discuss my proposed work for the next year.

88 6.4.1 Acknowledgment of Work Distribution

I thank Cory C. Hayes for his assistance in running the experiment and revising the paper published from this work [123].

89 CHAPTER 7

MASKED SYNTHESIS

Patient mannequins have been used for decades in the United States to train clinical learners. However, learners cannot practice face-to-face interaction with patients using the commercially available patient simulators since they have static faces. Our goal is to use our robotics knowledge to make the face-to-face interaction between learners and patient simulators more natural by enabling them to express facial cues important for making diagnostic decisions. To achieve this goal, I followed a path in which I first provided a brief background describing the motivation behind my research. Then, I proposed a new method for evalu- ating synthesized facial expressions on robots by incorporating both subjective as well as computationally-based evaluation techniques. Next, I used performance driven synthesis to synthesize patient-driven facial expressions of pain on a humanoid robot and three vir- tual avatars and studied different research questions regarding pain perception including if having a clinical background affects pain perception. We also explored if the simulation embodiment (virtual avatar or physical robot) affects how participants perceive pain. I then described our work on developing a generalized ROS-based software framework for facial expression synthesis, and we used this framework as the software of our bespoke robotic head. In this Chapter, I describe my research on developing a computational model of atypical facial expressions. We used this developed model to perform masked synthesis on a virtual avatar. This includes overlaying a mask model of a pathology (e.g. stroke) onto standard performance driven synthesis.

90 Additionally, I present a method for shared control which automates controlling a robot’s face, but also gives the operator the option to take control in unpredictable situ- ations. This shared control method has four different modalities which are manual control, live puppeteering, pre-recorded puppeteering, and masked synthesis. In this chapter, we run an exploratory pilot study to evaluate the masked synthesis modality of our shared control system by using it to synthesize Bell’s palsy on an avatar face.

7.1 Background

An estimated 127,000 Americans develop or are born with facial paralysis (FP) each year [29]. One out of every sixty people in the world will suffer from facial nerve palsy [16]. These individuals have atypical expressions (ATE), where the left and right sides of the face are asymmetric. FP can happen for different reasons, such as stroke, Bell’s palsy, Parkinson’s, birth trauma, infection, and damage to facial nerves [29]. Patients with atypical expressions have barriers in facially expressing their emotions, which affects the quality of their social interactions. For example, Tickle-Degnen et al. found that even if Parkinson’s patients try to convey their happiness, observers tend to get the perception of depression or deception [7, 180]. Bogart et al. [29] found that people with severe FP are judged as less happy than people with mild FP. Both lay participants and clinicians can have a wrong first impression of people with atypical expressions [28]. Tickle-Degnen et al. found that healthcare providers tend to have negatively biased impressions associated with patients who have atypical expressions [179]. This biased impression affects the quality of care people with atypical expressions receive [149, 153]. If a clinician has the impression that the person they are working with is depressed or not interested, this can affect both rapport with the patient and their therapeutical decisions. In addition, what people with atypical expressions feel is often different from what clinicians think patients are feeling [170]. True recognition of how patients feel is critical

91 to ensure communication between clinicians and patients is effective. If a patient and a caregiver cannot communicate effectively, there is higher chance for treatment to break down [7, 170]. Therefore, there is a desire within the healthcare community to improve cultural com- petency, particularly with regard to people with disabilities that cause atypical expressivity. Cultural competency is “the ability to provide care effectively for patients with diverse val- ues, beliefs and behaviors, including tailoring delivery to meet patients’ social, cultural, and linguistic needs” [98]. Simulation can be used as a training tool to improve clini- cians’ cultural competency in interacting with patients with atypical expressions and treat- ing them [149]. This may help to improve communication between patients and clinicians and to strengthen their interaction. In an effort to develop an educational tool for clinical learners to improve their skills in understanding the feelings and emotions of patients with atypical expressions, we in- troduced the concept of masked synthesis. This includes building computational models (masks) of different pathologies, such as stroke, facial palsy, and Parkinson’s, and trans- ferring them onto a patient simulator’s face. Our work on masked synthesis consists of two parts: The first part focused on exploring “masked synthesis” and the second part focused on developing a shared control system for facially expressive robots. Masked synthesis uses pre-built models of a patient’s face, which can be overlaid onto standard performance driven synthesis. In masked synthesis a simulation operator has the option of choosing a mask. Once a mask is chosen (e.g. stroke), the prebuilt model of that pathology is applied to the simulator’s face. The chosen mask will be then overlaid on top of the stream of facial expressions (See Figure 7.1). In order to control a system for unmasked and masked synthesis we needed to develop a method for clinicians to control the patient simulator’s face. The second part of our research on masked synthesis focused on developing a shared control system for facially

92 Figure 7.1. The setting for performing masked synthesis. expressive robots. Because simulation operators are not robotics experts, we needed to ensure that the control system of a simulator’s face is simple and adds minimal additional responsibility. A simulation center has a setup similar to figure 7.2. In this setup, there is an operator who sits at a control station. The control system displays robotic patient simulator (RPS) physiology signals and video from cameras within the simulation room. RPS systems can have more than 1000 physiological signals (e.g. heart rate and respiratory pulse), which can be modified by the simulation operator. In current RPS systems, the operator manu- ally changes most of the patient simulator’s parameters. For example, if a student injects medicine into the simulated patient, the operator may manually change some signals, such as the heart rate. Simulation operators need to change many parameters on-the-fly, entirely depending on what the students are doing during the simulation. Additionally, simulation operators run many different types of simulations. Because of having many different variables, it is not possible to make RPS systems fully autonomous. However, using our knowledge in engineering and computer science we can help ease the operator’s burden.

93 Figure 7.2. An example of a control station at a simulation center.

We need to add new functionality to an RPS interface to enable an operator to control the patient simulator’s facial expressions. One simple way to do this could be adding one slider to the simulation interface for each of a robot’s degrees of freedom. The operator could then manually change these sliders to synthesize different expressions on the robot. For example, if the simulated patient becomes unconscious, the operator closes the robot’s eyes by using the slider corresponding to the eyes. However, this would be impractical since it is time-consuming and adds significant additional responsibility to the operator who already is performing multiple tasks for modifying the patient’s state. Another approach could be adding a camera to the set-up facing the operator. This camera could capture the operator’s facial expressions and map them in real-time to the robot’s face using our performance driven synthesis algorithm described in Chapter 5 (live puppeteering). The problem with this option is that the operator may not be always facing the camera. The operator might need to shift their face to look at monitors, or they might be busy performing other tasks. Ideally, a patient simulator would be controlled fully autonomously and express the ap- propriate facial expressions based on the feedback it receives from other bio-signals of the patient simulator. For example, if the simulator’s heart stops, its face would automatically

94 stop moving and the eyes become closed to express unconsciousness. Since the simulation scenarios are predefined, a fully autonomous control system may seem feasible. However, even in pre-defined simulation scenarios, many unpredicted conditions can happen. For example, students might inject the wrong medicine, and therefore, instead of conveying pain facial expressions predicted for injection, the patient simulator might need to express unconsciousness. Therefore, generating facial expressions for a robot’s face requires a mixture of human supervision and automatic control [59]. The second part of our research on masked synthesis focused on developing a shared control system which satisfies this need. The term “shared control” refers to sharing the autonomy between humans and robots [62, 156]. For robots to perform simple tasks in a controlled laboratory environment, fully autonomous control systems are often appropri- ate. However, for dealing with complex and unpredictable scenarios, control systems that are semi-autonomous, supervised or partially controlled by human operators, are a more appropriate solution [62]. In these scenarios, the presence of a human is necessary for taking the responsibility of making critical decisions in high risk situations [128]. In an ideal shared control system, the user will provide high-level instructions and the control system is intelligent enough to perform the tasks autonomously [62, 156]. In the literature, shared control systems have been explored in a wide variety of applications [156] [62, 105]. However, little work outside of our own has focused on shared control for facially expressive robots and avatars.

7.2 Impact

The contribution of our work is to create a method for modelling facial expressions of atypical faces and use these models to develop a new shared control model for opera- tors of expressive robots and virtual avatars. We applied this work to a specific medical application: increasing the realism of patient simulators. Our work will enhance the understanding of different pathologies and help clinicians to

95 assess them effectively and better interact with patients. Particularly, our work on masked synthesis can help to develop a tool for training clinicians to better understand the feelings and expressions of patients with different pathologies such as stroke, Bell’s palsy, cerebral palsy, and cancer. This work not only helps improve how clinicians are trained on patient simulators, but is also transformative across a variety of other fields such as computer vision, virtual characters, biometrics, and security. Developing a shared control model for facially expressive robots allows operators to easily and effectively control facial expressions on a patient simulator. Our shared control system automates the job of controlling a robot’s face, but also gives the operator the option to take control in unpredictable situations. Similar to previous chapters we focused on the application of patient simulation. However, the control system that we designed can be used in other applications such as entertainment (e.g. film animation and video games) and education (e.g. virtual learning and teaching). Our work is also important for the greater robotics and affective computing communi- ties because it allows researchers to explore new methods of facial expression synthesis. For example, our work can be used in teleconferencing and film animation to control and manipulate the expressions and identity of a virtual character. It also can be used in com- puter games, to mask the identity or expressions of a user. It has also applications for biomedical studies (e.g., stress/pain monitoring and autism therapy.)

7.3 Research Questions

In this work, we aimed to address two research questions:

RQ1 How can we computationally model facial characteristics of asymmetrical expres- sions?

RQ2 How can these asymmetric models be used by a simulation operator to effectively control an RPS system?

To address the first research question, we focused on one pathology which exhibits

96 asymmetrical expressions: Bell’s palsy (BP). Although we focus on BP, our approach is generalizable to other pathologies such as stroke, Parkinson’s disease, or cerebral palsy. To address the second research question, we designed a shared control system for a robotic head which has different levels of autonomy and modes of operation. This control system will allow the operator to command the robot from a high level of abstraction, instead of sending low level commands for controlling every single servo motor. We addressed these research questions by completing the following steps. RQ1:

• Step 1: Acquire videos of patients with BP and extract facial features using CLMs

• Step 2: Build computational models of facial characteristics of BP

• Step 3: Test and validate the developed mask models

RQ2:

• Step 4: Use the developed mask model to design a shared control system with dif- ferent functionality modes using the models from RQ1

7.4 Acquire Videos of Patients with Stroke and BP and Extract Facial Features Using CLMs

We focused on one example pathology, BP which affects the facial nerve, causing facial weakness and an inability to control the affected side of the face. The facial weakness usually involves eyes, mouth, and forehead (see figure 7.3). To build computational models of facial characteristics of atypical expressions, the first step was to acquire source videos from patients expressing a wide range of atypical facial expressions. To acquire realistic source videos from BP patients, one option could be to acquire an existing database recorded from BP patients, similar to the database we used for pain, anger, and disgust in our previous work (see Chapter 4). However, to the best of our knowledge, a database of facial expressions recorded from people with BP does not

97 Figure 7.3. The facial weakness in BP usually involves the eyes, mouth, and forehead. exist in the literature. Therefore, as an alternative option, we used self-recorded, publically available videos from BP patients on YouTube. Many people who have experienced BP have recorded videos of their experience from their diagnosis to their recovery. In these videos, patients have conveyed a wide range of expressions including raising their eyebrow, furrowing their brow, smiling, and closing their affected eye to show how BP affects these expressions and how the weakness changes over time. Many patients have posted these expressive videos on YouTube to share their experience with others who might go through the same situation. Figure 7.4 presents some example frames from a source video downloaded from YouTube, and shows how BP affects different facial movements. We downloaded ten source videos from YouTube that present BP patients conveying the four expressions (raising eyebrow, furrowing brow, smiling, and closing the eye) re- quired for assessing BP. To ensure that these videos had BP, we only downloaded videos in which either patients verbally state their BP diagnosis, or the video contains a textual tag indicating BP. I processed all the source videos to include only the ones with high quality in our work. Additionally, since the source videos had different lengths, we pre-processed them by cutting them to the same length and only including the parts that patients were

98 Figure 7.4. Sample frames from a source video presenting different facial expressions of a patient with Bell’s palsy. conveying facial expressions. To extract facial features, we used a recently released CLM-based face tracker [11] which is a newer version of the tracker that I used in Chapters 4 and 5. We used this open source CLM face tracker implemented by Baltrusitus et al. [11] to track 68 facial points frame-by-frame from each of the source videos. Similar to our prior work (Chapter 4), we asked two human judges to watch all the videos while the face tracker was running on them and labeled the videos that were accurately tracked across all of their frames. We only included the well-tracked videos in our work.

7.5 Build Computational Models of Stroke and Bell’s palsy

We built a mask model for BP inspired by the methods used in the virtual animation literature. Particularly, we developed our model based on the work by Boker et al. [30] who performed research on manipulating facial expressions and head movements in real time during face-to-face conversation in video conferencing. Their experiment included two conversants who were sitting in separate rooms, each facing a camera. The researchers used Active Appearance Models (AAMs) to track facial features of the first conversant. They modified the tracked facial expressions of the first conversant and used them to re- synthesize an avatar with the modified expressions. The avatar interacted with the second conversant in a video conference. Second con-

99 versants were asked if they noticed anything unusual during the conversation. None of the participants noticed they were talking to a computer generated avatar and none of them guessed that the expressions were manipulated. To modify the expressions, Boker et al. [30] used an AAM-based face tracker. In Active Appearance Models, the shape of the face is defined by a number of landmarks and their interconnectivity, which shapes a triangulated mesh [44]. In other words, shape s is

T defined by 2D coordinates of n facial landmarks: s = [x1, y1, x2, y2, ..., xn, yn] . These facial landmarks are vertices of the triangulated mesh. The training step of AAM models requires manually labeling facial landmarks in some frames and running Principal Component Analysis (PCA) to find the shape model of a face. After the training step, the shape model is defined as below:

m X s = s0 + PiSi (7.1) i=1

s0 is the mean shape and the si vectors are the m vectors that show the shape varia- tion of facial landmarks (mesh vertices) from the mean. The coefficients pi are the shape parameters which define the contribution of each of the shape vectors si in the shape s.

Therefore, scaling the pi parameters will exaggerate or attenuate the facial expressions.

m X s = s0 + PiSiβi (7.2) i=1 β is a scalar which will exaggerate the expressions if it is bigger than one and will attenuate the expressions if it is smaller than one. Boker et al. [30] used β values smaller than one to reduce expressiveness of a conversant. They wanted to study if changing the expressiveness would affect the behaviour during conversation. We used a similar approach to build a mask model for BP. However, in our work, we used a CLM-based face tracker to track 68 facial points [11]. Constrained Local Models (CLM) are similar to Appearance Local Models (AAM) except they do not need manual

100 labelling and therefore are an improvement to AAMs. Similar to Boker et al. [30], we scaled the CLM parameters by using equation 7.2. However, instead of attenuating the parameters like Boker et al. [30], we scaled them based on the scaling parameters that calculated for BP in the following section.

7.5.1 Find the Scaling Parameters βi for Bell’s palsy

The CLM-based face tracker that we used in our work tracks 68 facial points frame- by-frame. These 68 facial points are vertices of a triangulated mesh. To build the mask model of a pathology (e.g. stroke or BP), we found the 68 scaling parameters for the facial feature points for that pathology. We did this in three steps which are explained below. First, we picked a well-tracked source video that includes a wide range of facial ex- pressions expressed by a patient with the desired pathology (BP). In the chosen video, we picked 20 frames (f1, f2, ...., f20) in which the person has his/her maximum severity of the pathology. Second, using the CLM-based face tracker, we tracked 68 facial points in each of the 20 chosen frames. We preprocessed the feature points to remove the effect of translation, rotation, and distance to the camera. Then, using the 2D coordinates of the facial features of the unaffected side of the face, we calculated the 2D coordinates of the other part of the face, assuming that the person did not have atypical expressions. To do so, we used the tip of the nose as a reference point. The tip of the nose stays static through transitioning from one facial expression to the other. In a person without FP, each facial feature of one side of the face has the same distance to the tip of the nose as its corresponding point on the other side of the face. For example, the left eye corner and the right eye corner each have the same distance to the tip of the nose. In a patient with atypical expressions, this is not true. Therefore, I used the distance from the tip of the nose to calculate the 2D coordinates of the feature points on the affected side of the face, assuming the person did not have FP.

101 Without loss of generality, assume that the left side of the face is affected by FP. In each frame we tracked 2D coordinates of 68 facial features. As figure 7.5 presents, there are 29 feature points on each side of the face and 10 feature points on the line symmetry of a face. Feature points (0,..., 28) are the 29 points on the right side of the face, feature points (29,..., 38) are the 10 feature points on the line of symmetry of the face, and feature points (39,..., 67) are the 29 feature points on the left side of the face. I used feature point 33 (the tip of the nose) as a reference point.

After removing the effect of translation and rotation in each frame, Rx and Ry in 7.3 are two arrays of x and y coordinates of the 29 feature points on the right (unaffected) side of the face. Lx and Ly in 7.4 are two arrays of x and y coordinates of the 29 feature points on the left (affected) side of the face.

Rx = [x0, x1, ..., x28] ,Ry = [y0, y1, ..., y28] (7.3)

Lx = [x39, x40, ..., x67] ,Ly = [y39, y39, ..., y67] (7.4)

0 0 Lx and Ly are the two arrays with the estimated x and y coordinates of the left (affected) side of the face assuming the person did not have atypical expressions.

0 0 0 0 0 0 0 0 Lx = [x39, x40, ..., x67] ,Ly = [y39, y40, ..., y67] (7.5)

As figure 7.5 presents:

0 yi = yi−39 38 < i < 68 (7.6)

0 xi = (x33 − xi−39) + x33 = 2x33 − xi−39 38 < i < 68 (7.7)

0 Third, I compared the calculated coordinates of points from the second step ( Lx and

102 Figure 7.5. Assuming the person did not have atypical expressions, I calculated the 2D coordinates of feature points on the affected side of the face.

0 Ly ) with the actual coordinates of the feature points on the affected side of the face ( Lx and Ly ). Dividing the actual coordinates by the calculated coordinates gave the scaling parameter for x and y coordinates of each of the facial points. Therefore, the scaling parameters, βi,x and βi,y for x and y coordinates of each facial point on the left side of face were calculated as below:

0 0 xi yi βi,x = βi,y = , 38 < i < 68 (7.8) xi yi

All the βi scales are one for the 29 facial points on the unaffected side of the face. All the βi scales are less than one for the 29 facial points on the affected side of the face. This is because FP causes weakness in the affected side of the face and therefore, feature points of that side have less movement. For each feature point on the affected side of the face, we calculated its scaling parame- ter in all of the 20 frames and averaged them. As the FP gets better, the scaling parameters

103 Figure 7.6. Sample frames from three patients with Bell’s palsy. We used data from these patients to develop our three mask models. get closer to one, however for our work we considered frames with the highest intensity to build a model for any pathology. Possible future work could be building a model of different pathologies in different stages of the disease with different intensities. Additionally, applying the same method on different patients may output different scales and yield different mask models. This is because BP has some idiosyncratic facial characteristics. Although every case of BP includes some degree of weakened mobility on one side of the face, the severity may vary in individual patients. I built three models using source videos of three patients with BP. Figure 7.6 shows some expressions from these three patients. For this work we approached our research with respect to the Facial Ac- tion Coding System (FACS) which describes common characteristics of facial expressions across all human faces. In the next section I will describe an exploratory study as an initial evaluation of our masked synthesis module. Additionally, this study compares the three developed mask

104 models to explore which one creates more realistic atypical expressions.

7.6 Test and Validate the Developed Mask Models

In chapter 3 we discussed various ways of evaluating synthesized facial expressions and explained that that from one point of view, evaluative methods can be divided into two classes: quantitative and qualitative. From another point of view, evaluation can be subjective, expert-based, or computational. To have an initial evaluation of our masked synthesis module and get feedback to fur- ther refine it, we ran a qualitative, expert-based pilot study. The experts in our study were four clinicians familiar with assessing facial paralysis in BP patients. Getting initial feed- back from clinicians is very valuable because it provides us insights for improving the mask model and exploring its potential for being used as part of an educational tool for clinicians. Additionally, future users of our masked synthesis module are clinicians and developing an educational tool for clinicians should be through an interdisciplinary collab- oration involving medical educators and computer scientists [50][67][68]. Using the three BP mask models developed in section 7.5.1, we performed masked synthesis on a virtual avatar to have an initial evaluation of our masked synthesis module. We also were interested to know which of the three developed mask models for BP creates more similar expressions to real patients. In section 7.1, we defined mask synthesis as using pre-built models of a patient’s face, and overlaying them onto standard performance driven synthesis signal. In our work, we used source videos from a performer without BP. These source videos included basic expressions required for assessing BP, as well as natural-looking expressions that happen during the interaction between a BP patient and a clinician. Our stimuli creation process included three steps. First, we used a Constrained Local Model (CLM) based tracker to extract 68 feature points frame-by-frame from each source video [11]. Next, we used our prebuilt models of BP and overlaid them on the stream of

105 facial expressions (See Figure 7.1). We then mapped the masked feature points to a virtual character’s control points for animation in Steam Source SDK. For each source video, we repeated this process three times, each time with one of our three developed mask models. Finally, we ran a exploratory case study with four clinicians to evaluate our mask models.

7.6.1 Source Video Acquisition

We aimed to get feedback on how realistic our mask model is when applied to a virtual character for synthesizing signs of BP and how similar the synthesized expressions are to those of real BP patients. For this purpose, we recorded two sets of source videos which are required for diagnosis of BP and show everyday natural expressions of patients with BP. We decided to record these two sets of source videos because clinicians need to be competent in understanding and recognizing these two sets of expressions in BP patients to effectively provide care for them. For the first set of expressions, a performer without BP sat in front of a webcam and was instructed to perform five expressions required for assessing paralysis of different facial parts in BP patients [60]. These included: closing their eyes, raising their eyebrows, furrowing their brow, smiling, and raising their cheeks. The performer was instructed to repeat each expression five times. This gave us five source videos, each about 10 seconds long. For the second set of source videos, a performer without BP sat in front of a webcam and answered a list of questions that a clinician asks when diagnosing a BP patient [60]. Our goal was to record natural expressions that happen during interaction between a BP patient and a clinician. The naturalistic video was 40 seconds long. The performer in our naturalistic video, did not have BP, but we intended to apply our mask models to the performer’s expressions before transferring them to a virtual avatar. Therefore, we expected the synthesized expressions to be similar to the expressions that a

106 clinician will perceive from a BP patient while asking the diagnostic questions. At the end of this step we had overall six source videos (five videos of diagnostic expressions and one video of natural expressions).

7.6.2 Feature Extraction

To extract facial features, we used a face tracker [11] to track 68 facial points frame- by-frame from each of the source videos. We then applied the scaling parameters from Section 7.5.1 to the extracted feature points to mask them. For each video we extracted three groups of masked features, each with one of our three developed mask models.

7.6.3 Stimuli Creation

We recorded 18 stimuli videos. We had six source videos (five diagnostic expressions and one natural looking expressions) and three mask models. For each source video, we ran our ROS module from Chapter 6 for performance driven synthesis and overlaid our BP mask models. We then transferred the masked expressions to an avatar model. At the end of this step we had 18 (six source videos × three mask conditions) stimuli videos. This included 15 (five expressions × three mask conditions) videos of expressions required for assessing BP, and three naturalistic videos (one source video × three mask conditions).

7.6.4 Exploratory Case Study

We needed initial feedback from experts in assessing facial paralysis to further refine and improve our mask model of BP. We recruited four clinicians (one male and three fe- male) from the University of California, San Diego. Participants were all native English speakers, and aged between 35 to 59 years old (mean age 48.7 years old). Three of the clinicians were physicians (MD) and one was a clinical nurse. These clinicians had on av- erage 21 years of face-to-face interaction with patients and had all encountered BP patients

107 in their careers. We built a structured questionnaire in Survey Monkey to show them our stimuli videos and ask for their feedback on various aspects of our synthesis results. At the beginning of the survey, clinicians were given instructions and a brief summary of our project. Next, they watched a video of a real BP patient from section 7.4 performing various expressions. We showed them this video to refresh their memory on facial expressions of patients with BP. This video was not used in developing any of the three mask models. Then, clinicians answered questions in two parts. In the first part, they watched the 15 diagnostic stimuli videos in a random order. In each video, a cross-hair was shown for three seconds, followed by a virtual avatar conveying one of the expressions required for assessing BP. They were then asked to rate the similarity of the avatar’s expressions to those of real BP patients for the entire face, and for that particular facial part (e.g. eyebrows for the raising eyebrow videos). For rating, we used a 4-point Discrete Visual Analogue Scale (DVAS). A one on the scale corresponded to “not at all similar to real patients” and a four on the scale corre- sponded to “very similar to real patients”. They could watch each video as many times as they need. Figure 7.7 shows sample frames from the first part of the study. In the second part of the study, participants watched the three stimuli videos of nat- uralistic expressions in a random order. In each video, a cross-hair was shown for three seconds, followed by a virtual avatar conveying natural expressions that would happen during interaction with a clinician. Participants were then asked to rate the similarity of the avatar’s expressions to those of real Bell’s palsy patients overall (entire face), and for each facial part (eyes, eyebrows, cheeks, and mouth) on the same 4-point Discrete Visual Analogue Scale (DVAS). Additionally, after watching each naturalistic video, they were asked to provide feed- back by answering these two questions: “Based on the video you just watched, what as- pects did you think were the most realistic?” and “What areas did you think could be

108 Figure 7.7. Sample frames from the stimuli videos for the first part of the study. Upper row shows unmasked expressions and lower row shows masked expressions. From left to right: raising cheeks, furrowing brow, closing eyes, smiling, and raising eyebrows. improved?”

After completing both parts of the study, clinicians were asked to answer these two open ended questions: “How useful could this kind of simulation tool used in clinical education?” and “Please share any additional feedback or comments.”

7.6.5 Data Analysis

We ran a qualitative study in which we collected data through a structured survey to find out how clinicians perceive atypical expressions of our avatar. We ran a qualitative pilot study because we were not looking for a ‘yes’ or ‘no’ answer for a hypothesis. We were interested in exploring the perceived quality of our three mask models and what parts of them we need to improve. We aimed to find details and description of how clinicians perceive our synthesized atypical expressions and what we need to improve in our mask models. In our case study, we focused on a small group of clinicians (four) to produce a detailed report on how similar our mask model is to real patients and how we can improve it. We aimed to identify which

109 parts of our mask needed improvement before running a bigger study with a large number of clinicians in the future. Moreover, we wanted to compare our three mask models and find out which one to use in a future study with a large group of participants.

7.6.6 Results

The overall average score for similarity between the synthesized masked expressions and real patients for the diagnostic expressions was 2.66 (s.d. =0.98). Table 7.1 reports the full results for each of the five diagnostic expressions and the entire face. This table suggests that overall clinicians thought that closing eyes and smiling were the most similar expressions to real patients. Cheeks had the lowest overall similarity scores suggesting that we need to improve our mask model around cheeks. These similarity scores also show that overall the second mask created more realistic diagnostic expressions. For all the diagnostic expressions (closing eyes, smiling, furrowing brow, and raising cheeks), except raising eyebrows, the second mask outperformed (or was as good as) the other two mask models. This suggests that before using the second mask model in a larger study, we need to improve it around the eyebrows.

For the naturalistic expression videos, we were only able to include the ratings from three of the clinicians as one of them reported that he was not able to watch the videos for the second part of our study. Table 7.2 reports the full results for the naturalistic expression. For the entire face, and all the facial parts except eyes, the third mask created natural expressions that are more similar to real BP patients. This table also suggests that overall clinicians thought that eyes were the most similar part of the face to real patients and cheeks were the least similar part of the face to real BP patients.

110 TABLE 7.1

FULL RESULTS FOR EACH FACIAL PART AND THE ENTIRE FACE FOR THE DIAGNOSTIC STIMULI VIDEOS ON A 4-POINT DVAS

Overall(all three masks) Mask 1 Mask 2 Mask 3

Eyes (closing eyes)

Average similarity score 3.00 3.25 3.25 2.50

s.d. 0.95 0.95 0.50 1.29

Inner-eyebrows (furrowing brow)

Average similarity score 2.75 2.75 2.75 2.75

s.d. 1.13 1.25 0.95 1.50

Outer-eyebrows (raising eyebrows)

Average similarity score 2.50 2.75 2.50 2.25

s.d. 0.90 0.95 1.29 0.50

Cheeks (raising cheeks)

Average similarity score 2.08 1.75 2.75 1.75

s.d. 0.90 0.95 0.95 0.50

Mouth (smiling)

Average similarity score 3.00 2.50 3.50 3.00

s.d. 0.85 0.57 0.57 1.15

Overall (all facial parts)

Average similarity score 2.66 2.60 2.95 2.45

s.d. 0.98 0.99 0.88 1.05

111 TABLE 7.2

FULL RESULTS FOR SYNTHESIZED NATURALISTIC EXPRESSIONS ON A 4-POINT DVAS

Overall(all three masks) Mask 1 Mask 2 Mask 3

Eyes

Average similarity score 3.44 3.66 3.33 3.33

s.d. 0.72 0.57 0.57 1.15

Eyebrows

Average similarity score 2.66 2.33 2.66 3.00

s.d. 0.70 0.57 0.57 1

Cheeks

Average similarity score 2.11 2.00 2.00 2.33

s.d. 0.78 1.00 1.00 0.57

Mouth

Average similarity score 2.33 2.33 2.00 2.66

s.d. 0.86 0.57 1.00 1.15

Overall (entire face)

Average similarity score 2.88 2.66 2.66 3.33

s.d. 0.60 0.57 0.57 0.57

112 7.6.7 Discussion

We used descriptive statistics to describe, show, and summarize our data and highlight patterns. Descriptive statistics cannot be used to make conclusions and confirm/disprove any hypothesis. They help us show, and report our results and find patterns in them. For example, clinicians picked low similarity ratings for the avatar’s cheeks in both the diag- nostic and naturalistic expressions in our pilot study. This showed that the avatar’s cheeks do not move like real patients and need further testing. We think this happened because the control point responsible for creating wrinkles around the nose and cheeks did not have separate control for the left and right side of the face. Therefore, we could not show asym- metry very well for the cheeks. In the future, we might need to use a different avatar to create better asymmetry around the cheeks. One of our goals from running this pilot study was to compare the three mask models we developed for BP and decided which one to use in a future study with more participants. Our results suggested using the second mask for synthesizing diagnostic expressions and the third mask for synthesizing naturalistic expressions. Additionally, feedback from the clinicians suggested that we need to add more static asymmetry (facial drooping) to the avatar in order to make it more similar to real patients. One of the clinicians mentioned that we should add “puffing out cheeks” to the diagnostic expressions as they are useful for assessing asymmetry of mouth. In the future, we plan to add this expression to the expressions used in creating our mask model. Another clinician recommended adding forehead lines as they are important in understanding the intensity of Bell’s palsy as well as having a more realistic furrowed brow. Clinicians also suggested that in BP patients the affected eyeball often rolls into the head when a BP patient tries to blink since closing the affected eyelid is not working. We will add this effect to our synthesized expressions for future experiments. This exploratory case study helped us find directions for further improvement of our mask model before using it in a large study with a large group of participants. We wanted

113 to explore the strengths and weaknesses of our mask model for each facial part and find recommendations on how to improve it. Overall, we found that smiling and closing eyes were the most realistic expressions and raising cheeks needed the most improvement. This helped us know where to focus while improving our mask model for a future experiment.

7.7 Develop a Shared Control System with Different Functionality Modes Using the Models from RQ1

We designed a shared control system to allow operators to either directly puppeteer facial expressions of an RPS system (unmasked synthesis) or use the models generated in RQ1 to simulate different pathologies (masked synthesis). This shared control system also allows an operator to interfere and manually change an expression if an unpredicted situation happens during a simulation (manual control). This control system will make simulation in healthcare significantly more natural while causing minimal additional re- sponsibility to the operator. This proposed system has four modes of operation:

• Manual control

• Live puppeteering

• Pre-recorded puppeteering

• Masked synthesis

Below, we explain each of the modalities in detail.

7.7.1 Manual Control

Using the manual control mode, the operator can directly control each motor of the robot. In the manual control mode, each motor has a slider that allows the operator to manually move that motor using its slider. For example, the operator can manually close the simulator’s eyes or open its mouth. Figure 7.8 shows an example interface for a manual control mode.

114 Figure 7.8. An example of a manual control interface for a virtual avatar from the Steam Source SDK.

The operator can manually synthesize a facial expression on the robot using this mode. While the manual mode is not practical for synthesizing complicated expressions and pathologies such as boredom and pain, it is useful for synthesizing very simple facial movements such as closing eyes and opening mouth. This mode also allows the operator to interfere and overwrite an expression if an unpredicted situation occurs. For example, if a student decides to give medicine to the patient simulator which was not in the pre-defined scenario, the operator can manually open the simulator’s mouth for receiving the medicine.

7.7.2 Live Puppeteering

If the operator selects the live puppeteering mode, the camera facing the operator will turn on and the simulator will mimic the operator’s facial expressions in real-time. This will happen using our performance driven synthesis method described in Chapter 4. For ex- ample, if a student asks a question from the simulator, the operator can answer the question and her mouth movements will be transferred to the robot in real time. For the simulator to speak, the operator can speak via a microphone. Figure 7.9 shows our performance driven synthesis algorithm for live puppeteering.

115 Figure 7.9. Live puppeteering algorithm in our shared control system.

7.7.3 Pre-recorded Puppeteering

The operator also has the option of mapping facial expressions from a pre-recorded video onto the simulator’s face (pre-recorded puppeteering). This mode will also use our performance driven synthesis from Chapters 4 and 5. For example, if the operator wants to map realistic, dynamic facial expressions of pain on the robot, she can selects a pre- recorded video of a patient in pain. While the pre-recorded puppeteering is running, the operator can take care of other tasks and does not need to focus on controlling the robot’s face.

7.7.4 Masked Synthesis

Masked synthesis includes transferring the computational models of different patholo- gies from RQ1 to a robot’s face. Once a mask is chosen, the prebuilt model of that pathol- ogy (e.g. Bell’s palsy) will be applied to the simulator’s face. If the masking option is on, the operator can choose from a predefined list of pathologies. The operator has the option

116 of turning on and off a mask model at any point of the simulation. For example, if the live or pre-recorded puppeteering mode is running and the operator decides to turn on the BP mask, the simulator will show signs of BP, while still mimicking facial expressions of the live or recorded performer. In this case, the CLM tracker tracks 68 facial points in each frame. In each frame, the scaling parameters found in section 7.5.1 will be applied to the CLM feature points. The scaled feature points will be mapped onto the robot. Figure 7.1 shows masked synthesis with the live puppeteering mode running. If the live or pre-recorded puppeteering mode is off and a mask is chosen, the scaling parameters found in section 7.5.1 will be applied to the CLM feature points of an neutral (static) face which will be available in the control interface. The interface of a shared control system should have a setting to turn masking on and off. In addition to our developed mask model for BP, our algorithms developed can be used for modeling other pathologies and adding them to the list of masking options.

7.8 Chapter Summary

This chapter described our method for computationally modelling different pathologies which cause atypical expressions. These models are not only useful for developing a new facial expression synthesis method (masked synthesis) for robots but also can be used as part of an education tool to improve learners’ skill in understanding and assessing expres- sions of patients with different pathologies. Using our method for masked synthesis, we synthesized BP on a virtual avatar and ran an exploratory pilot study with four clinicians to have an initial evaluation of our mask model. We will use the gathered feedback from clinicians to improve our mask model before running a larger study. Additionally, we used our mask model to design a shared control system for controlling an expressive robotic face. The proposed shared control will allow the operator to select different modes of functionality with different levels of autonomy. We have been designing a new type of physical patient simulator with a wider range

117 of expressivity. This chapter presented our work on developing a control system for this robotic head. The future users of our robotic head will be simulation operators who are not robotic experts and also need to perform multiple tasks while running a simulation. Therefore, we designed a control system which does not require a high cognitive load while making the simulation more realistic. One future work would be designing a control interface which allows users to choose any of the modalities in our shared control system. Such a control interface should have a menu for users to select the desired modality. The operator can receive visual feedback from the robot’s face by seeing it in the control interface using the camera in the simulation room.

7.9 Acknowledgment of Work Distribution

I thank Margot Hughan for her contribution to create the avatar videos for this project.

118 CHAPTER 8

CONCLUSION

This chapter discusses the main contributions presented in this dissertation for the human-robot interaction and healthcare simulation communities. Additionally, it outlines open research questions discovered during the course of my research that should be ad- dressed when using facially expressive robots for improving the realism in patient simula- tion.

8.1 Contributions

8.1.1 Developed a Method for Evaluating Facial Expression Synthesis Algorithms on Robots and Virtual Avatars

Chapter 3 focused on developing a method for evaluating facial expression synthesis algorithms on robots and virtual avatars. Appropriate evaluation is needed to ensure that the expressions synthesized using our algorithms are perceived as intended. Even though in the HRI literature many methods have been proposed for synthesizing facial expressions on robots and avatars, less work has been done on developing a standard method for evaluating the synthesized expressions using these algorithms. Currently, subjective evaluation and computationally-based evaluation are the two most commonly used methods for evaluating synthesized expressions. In chapter 3, we discussed the strengths and weaknesses of these two methods. Subjective evaluation includes user surveys which provide valuable information about how the synthesized expressions are perceived by humans. However, there are many dif-

119 ferent methodological issues that should be carefully considered by researchers when per- forming a subjective evaluation. For example, the average age of participants, their ed- ucational level, their cultural background, their native language, and the gender ratio of participants can all affect the results of evaluation. In computationally-based evaluation, synthesized expressions are quantitatively com- pared with predefined computer models. A fully computational evaluation model would eliminate the challenges of subjective evaluation. However, developing an accurate com- putational model for each facial part is very complicated. Even though some researchers have worked on developing a computational model for some facial parts, these models may not be generalizable to different robot platforms since each robot has its own characteristics and DOFs. We created a new model that combines the advantages of both subjective and computationally- based evaluation methods. Our proposed method uses subjective studies to find a baseline to compare all the methods against. This baseline is the best expression that a human can perceive from that robot with all its characteristics and limitations. We then compare the synthesized expression computationally with that baseline we produced by a subjec- tive study. Our evaluation model is one of the first attempts toward designing a standard method for evaluating facial expression synthesis algorithms. Designing a standardized method for evaluating synthesized expressions on robots is important to make sure that robots are capable of expressing appropriate expressions un- derstandable by humans. Our work contributes to HRI because it highlights the need to carefully consider multiple aspects and challenges when choosing an evaluation method.

8.1.2 Developed an ROS-based Module for Performing Facial Expression Synthesis on Robots and Virtual Avatars

HRI researchers have been using a wide range of social robots in their work. These robots have different hardware and different facial DOFs. This makes it challenging to

120 develop a software for synthesizing facial expressions that is transferable to other robots. Therefore, researchers have been mostly developing their own software systems that are customized to a specific robot. Chapter 6 described my development of a platform-independent framework for syn- thesizing facial expression on robots and avatars. This module, which is also the software of our bespoke robotic head can be used to map facial expressions from live or recorded performers to robots and avatars. Using ROS as a software abstraction, other researchers can use our module to upgrade the software or increase its functionality. The work described in this dissertation contributes to HRI and robotics communities by being one of the first attempts to develop a platform-independent framework for per- formance driven synthesis. This work also impacts the computer graphics and affective computing communities by reducing the time it takes to create new synthesized expres- sions.

8.1.3 Synthesized Patient Driven Facial Expressions on Patient Simulators

My research focused on one application of facially expressive robots which is increas- ing realism in patient simulation by enabling robotic patient simulators to express a wide range of expressions on their faces. We worked on building a humanoid robot as a next generation of patient simulators. While finishing the hardware of our robot, we began to test our algorithms in simulation with virtual avatars and another humanoid robot. We introduced a new method for patient-driven pain synthesis, and tested how well syn- thesized expressions of pain were perceived using a virtual patient simulator. This study replicated an experiment by Riva et al. [152], who used manual Facial Action Coding Sys- tem (FACS)-animation to synthesize pain on virtual avatars. We found that participants are able to accurately distinguish pain from commonly conflated expressions (anger and dis- gust), and also showed that automatic, patient-driven pain synthesis is a viable alternative to laborious key-frame animation synthesis techniques.

121 Our work contributed to HRI and clinical communities as it supported that notion that patient-driven pain synthesis is comparable to FACS-animated pain synthesis. To the best of our knowledge, this was the first work on using performance-driven synthesis to map pain expressions from real patients on avatars. Our results also showed that automatic pain synthesis can arithmetically result in higher detection accuracy than manual synthesis using FACS-animation experts. This finding suggested that our method can be used as part of a training tool for clinicians to improve their pain-reading skills which, according to the literature, is in need of attention.

8.1.4 Investigated Usage of Facially Expressive Robots to Calibrate Clinical Pain Per- ception

Our research introduced a novel application of facially expressive social robotics in healthcare, and explored their usage within a clinical experimental context. Current commercially- available RPSs, the most commonly used humanoid robots worldwide, are substantially limited in their usability and fidelity due to the fact that they lack one of the most impor- tant clinical interaction and diagnostic tools: an expressive face [124]. Using automatic facial expression synthesis techniques, we synthesized pain on a humanoid robot and a comparable virtual avatar. We conducted an experiment with a large group of clinicians and laypersons to explore differences in pain perception across the two groups, and studied the effects of embodi- ment (robot or avatar) on pain perception. Our results suggested that clinicians have lower overall accuracy in detecting synthesized pain in comparison to lay participants, showing that there is a need to improve clinical learners’ skills in decoding expressions. We also found that all participants are overall less accurate in detecting pain from a humanoid robot in comparison to a comparable virtual avatar, supporting other recent findings in the HRI community. This work contributes to the HRI and clinical communities by revealing new insights

122 into the use of RPSs as a training tool for calibrating clinicians’ pain detection skills. Our work also gives some insights on how simulator type (avatar or robot) or educational back- ground (clinician or non-clinician) may affect perception of pain. We found support for earlier findings in the literature, that clinicians are less accurate than laypersons in decod- ing pain. Our work showed that this effect is found regardless of simulation embodiment (robot or virtual character). However, since training has been shown to help improve one’s ability in decoding facial expressions, expressive, interactive RPSs may be useful in the medical education curriculum to improve clinicians’ skills in decoding pain facial expres- sions [124].

8.1.5 Created a Computational Model of Atypical Facial Expressions

We developed a computational model of atypical facial expressions and used this model to perform masked synthesis on a virtual avatar. We also used our developed mask model to build a shared control system for controlling a robotic head. In an exploratory pilot study, we used our masked synthesis module to synthesize BP on a virtual avatar and we evaluated our mask model with the help of two clinicians familiar with diagnosing BP. Our work on atypical expressions contributes to enhancing the understanding of dif- ferent pathologies and help clinicians to assess them effectively and better interact with patients. Particularly, masked synthesis can help to develop a tool for training clinicians to better understand the feelings and expressions of patients with different pathologies such as stroke, Bell’s palsy, cerebral palsy, and Parkinson’s disease. We believe these kinds of interventions could also be useful in a clinical education context to help with destigmati- zation and to improve provider empathy.

123 8.2 Future Work

8.2.1 Design a Control Interface for Our Shared Control Model

In my work, I developed a shared control model for controlling a robotic head. A next step for this research is to design a control interface to allow users to choose any of the four modalities (manual control, live puppeteering, pre-recorded puppeteering, and masked synthesis) in our shared control model described in Chapter 7. This control interface should have a menu for users to select the desired modality. The operator should also receive visual feedback from the robot’s face by seeing it in the control interface using the camera in the simulation room. Furthermore, a range of sensing modalities and other electronic components may be placed in the robot or in the simulation room to make the RPS more human-like and re- sponsive. For example, a heating element can be used to make the robot feel warm to the touch to simulate a fever, or a camera could be used to sense what actions learners are taking and make the robot react accordingly. All of these modalities would need to be carefully integrated into the operator’s control system and general operation of the RPS.

8.2.2 Conduct Additional Experiments to Evaluate Masked Synthesis and Shared Control Model

Chapter 7 presented development of a mask model for atypical expressions. We used our masked synthesis method to synthesize BP on a virtual avatar. The next step of this project will involve an iterative research process to further test, validate, and refine our de- veloped model of atypical expressions. Additionally, more mask models can be developed from a larger group of patients with different pathologies (e.g. stroke, Parkinson’s disease). We ran an exploratory case study with four clinicians to evaluate our mask model and received feedback for improving it. This feedback should be used to improve the mask model. Then, an experiment with a larger group of participants can be run to compare how

124 clinicians perceive masked expressions of an avatar or a robot compared to videos of real patients. Additionally, the literature suggests that both clinicians and lay participants can have a negatively biased first impression of people with masked expressions. What patients with atypical expressions feel is often different from what clinicians think they they are feeling. Future work may include exploring if practicing in simulation with a robot or avatar with masked expressions can help clinicians have a better understanding of what patients with masked expressions feel.

8.2.3 Design New Clinical Educational Paradigms That Include Practicing Face-to-Face Communication Skills Using Our Bespoke Robotic Head

One way of continuing this work is to design new educational paradigms for clini- cians that include practicing face-to-face communication skills using our bespoke robotic head. We have worked on building a bespoke robotic head as a next generation robotic patient simulator, able to convey a wide range of expressions and pathologies on its face. This robot has 16 DOFs and costs significantly less than commercially available humanoid heads. Our robot can be used in research areas, such as human robot interaction, visual speech synthesis, and simulation in healthcare. We ran some initial tests with our robot, including a pilot study in Chapter 6, but we need to run more experiments to fully under- stand the potential usage and effectiveness of our robot as an expressive patient simulator. Our findings in Chapter 5 suggest that clinicians have a lower accuracy than layper- sons in detecting facial expressions of pain in patients. The literature suggests that training can be effective in improving one’s ability to decode facial expressions [27, 72]. Therefore, developing new clinical educational paradigms involving practicing face-to-face communi- cation skills can be effective in helping clinicians achieve greater competency at decoding and understanding patients’ feelings and expressions. Most of the work described in this dissertation involved studying clinicians outside of

125 their daily work environment. This revealed valuable insights on using facially expressive robots in patient simulation. However, a number of factors were controlled in our studies that might not be the same in daily work environment of clinicians in hospitals [66]. For example, there was no distraction during our experiments and clinicians were focused on performing only one task after given detailed instructions. To increase the ecological va- lidity of our findings, the next phase of this work would require taking our robotic head to simulation center of hospitals and studying clinicians’ interaction with it in their daily work environment. This will provide valuable insights for designing interactive patients simulators that better meet clinicians needs and limitations in their workspace.

8.2.4 Synthesize Speech for Our Bespoke Robotic Head

This dissertation focused mainly on using facially expressive robots for improving re- alism in patient simulation. However, there are other aspects of RPS systems that can be improved as well to make simulation in healthcare more realistic. For example, in the current setup of a simulation center there is an operator who sits at a control station and controls an RPS’s physiology signals in addition to speaking on behalf of it via micro- phone. Speaking on behalf of an RPS not only adds more workload for the simulation operator, but also makes the simulation less realistic. One future project could be develop- ing a speech recognition and synthesis system for RPSs to make them more realistic and autonomous.

8.2.5 Improving Clinical Educators’ Cultural Competency Using Simulation with Our Bespoke Robotic Head

Clinicians are required to be culturally competent. Cultural competency is “the ability to provide care effectively for patients with diverse values, beliefs and behaviors, including tailoring delivery to meet patients’ social, cultural, and linguistic needs” [98]. Chapter 7 described our work on masked synthesis that can be used as part of a training tool for

126 improving clinicians’ cultural competency with regard to people with disabilities that cause atypical expressivity. Future work may include designing an experiment with our robotic head to examine how providing patient history can impact clinicians’ perceptions of their expressions. Ad- ditionally, we can test how visual cues that relate to a patient background or identity (e.g. gender, age, or ethnicity) can affect clinicians’ perception of their expressions. In the de- sign of our bespoke robotic head, we have developed a mechanism for skin attachment that makes is possible to easily change the skin. We have created various skins to simulate dif- ferent ages, genders, and ethnicities. By replacing the robot’s skin we can design a range of experiments to help clinicians recognize their biases. Identifying these biases through experiments can inform design of future simulation lessons for clinicians in order to help them become culturally competent.

8.3 Open Questions

8.3.1 Is Clinicians’ Lower Accuracy in Overall Facial Expression Detection a Result of the Decline in Empathy or Practicing with Inexpressive Patient Simulators?

Some researchers have found that clinicians tend to miss emotional cues and signs of distress in patients [72, 107, 126]. Our results in Chapter 5 also found support for earlier findings in the literature, that clinicians are less accurate than laypersons in decoding expressions [63, 83]. Some researchers have suggested that this is due to the decline in empathy as clinicians progress through their training [83]. Being overexposed to patients over time might cause clinicians to be less responsive to visual cues and expressions in patients. Years of training with inexpressive patient simulators might also be a reason for clinicians’ lower accuracy in decoding expressions. Each year over thousands of clinicians in the US are trained on inexpressive patient simulators [143]. Simulation has become a core part of the clinical education curriculum to

127 let learners safely practice various skills without the fear of harming real patients. However, practicing with patient simulators with static faces might gradually cause clinicians to fall into the wrong habit of not paying attention to a patient’s face and decrease their abilities to decode visual signs of pain and distress. The research described in this dissertation focused on using our robotics knowledge to build more expressive patient simulators able to have more realistic interaction with learners. Moving forward, it will be useful for the research community to explore and test underlying causes for clinicians’ lower accuracy in decoding expressions. Further research is required to test each potential hypothesis and to explore if practicing with expressive patient simulators can improve clinicians skills in decoding facial expressions in patients.

8.3.2 What Is the Long Term Effect of Practicing with Expressive Patient Simulators on Patient Outcomes and Preventable Harm?

The research in this dissertation covers several methods for synthesizing patient driven expressions and pathologies on patient simulators. Based on our findings and the research of others in the literature, we believe that practicing with facially expressive RPSs can im- prove clinicians’ interaction with patients and therefore would improve patient outcomes. Further work is needed to understand the long term effect of practicing with expressive pa- tient simulators on improving patient outcomes and reducing preventable medical errors. This will require a wide range of studies at several institutions, where data can be collected for assessing clinicians skills in face-to-face interaction with patients after practicing with expressive RPSs. Additionally, the collected data can be used to measure if there is any change in patient outcomes when clinicians practice with expressive RPSs.

8.3.3 Is It Possible to Make Robotic Patient Simulators Fully Autonomous?

Another question that arose during my research is the possibility of making RPSs fully autonomous. In Chapter 7 we discussed that in the current setup of simulation centers,

128 the operator who sits at the control station is in charge of manually changing most of the patient simulator’s parameters and even speaking on behalf of it. Simulation operators need to change many parameters in real time based on what the students are doing in the simulation. Ideally, we would want a patient simulator to be fully autonomous so that it can sense what students are doing and show the correct facial expressions and bio-signals. Designing a fully autonomous system for RPSs would be very beneficial for the clinical and HRI communities. However, designing such a system would be very challenging considering that even in pre-defined simulation scenarios, many unpredicted events can happen. Developing such an autonomous system requires researchers to consider many possible variations that can happen during a simulation.

8.4 Closing Remarks

This chapter summarizes the main contributions of my research with a focus on intro- ducing a novel application of social robotics in healthcare: high fidelity, facially expressive, robotic patient simulators (RPSs), and exploring their usage within a clinical context. One of the goals of HRI is to enable social robots to interact more naturally with people using different verbal and nonverbal channels. Throughout my research, I focused on synthesiz- ing facial expressions, a type of nonverbal behavior, on social robots. In particular, using autonomous facial synthesis techniques, I explored synthesis of various expressions and pathologies on RPSs. My research lays the groundwork for in- creasing realism in patient simulation by incorporating robots that are capable of having realistic face-to-face interactions with clinical learners. This work can help develop better training tools for clinicians to improve their skills in decoding visual signs of disease, pain, and distress in patients and increase their cultural competency to better prepare them for interacting with patients with a wide range of backgrounds. My hope is that this research can be fruitful for HRI and clinical researchers working on novel technologies to improve how clinicians are trained in simulation.

129 BIBLIOGRAPHY

1. Source SDK, valve software. URL: http://source.valvesoftware.com/sourcesdk.php, 2013.

2. A. Rzepcynski, T. Martin, and L. Riek. Informed consent and haptic actions in in- terdisciplinary simulation training. In Proceedings of the American Public Health Association (APHA), 2012.

3. B. Abboud and F. Davoine. Bilinear factorisation for facial expression analysis and synthesis. In IEEE Proceedings of Vision, Image and Signal Processing, volume 152, pages 327–333, 2005.

4. B. Abboud, F. Davoine, and M. Dang. Facial expression recognition and synthesis based on an appearance model. Signal Processing: Image Communication, 19(8): 723–740, 2004.

5. H. Admoni, A. Dragan, S. S. Srinivasa, and B. Scassellati. Deliberate delays dur- ing robot-to-human handovers improve compliance with gaze communication. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 49– 56. ACM, 2014.

6. A. Araujo,´ D. Portugal, M. S. Couceiro, and R. P. Rocha. Integrating arduino-based educational mobile robots in ros. In IEEE Autonomous Robot Systems (Robotica), pages 1–6, 2013.

7. R. C. Arkin and M. J. Pettinati. Moral emotions, robots, and their role in managing stigma in early stage parkinsons disease caregiving. In Workshop on New Frontiers of Service Robotics for the Eldery, RO-MAN, 2014.

8. A. B. Ashraf, S. Lucey, J. F. Cohn, and et al. The painful face: pain expression recognition using active appearance models. In ACM International Conference on Multimodal Interaction, 2007.

9. M. Aung, B. Romera-Paredes, A. Singh, S. Lim, N. Kanakam, A. de C Williams, and N. Bianchi-Berthouze. Getting rid of pain-related behaviour to improve social and self perception: a technology-based perspective. In 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), 2013.

10. A. L. Back and et al. Efficacy of communication skills training for giving bad news and discussing transitions to palliative care. ARCH INTERN MED, 167(5), 2007.

130 11. T. Baltrusaitis,ˇ P. Robinson, and L.-P. Morency. Openface: an open source facial behavior analysis toolkit. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10.

12. T. Baltrusaitis, P. Robinson, and L. Morency. 3d constrained local model for rigid and non-rigid facial tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2610–2617, 2012.

13. S. Baron-Cohen, A. Cox, G. Baird, J. Swettenham, N. Nightingale, K. Morgan, A. Drew, and T. Charman. Psychological markers in the detection of autism in infancy in a large population. The British Journal of Psychiatry, 168(2):158–163, 1996.

14. C. Bartneck, J. Reichenbach, and v. A. Breemen. In your face, robot! the influence of a characters embodiment on how users perceive its emotional expressions. In Proceedings of the Design and Emotion, pages 32–51, 2004.

15. J. N. Bassili. Emotion recognition: the role of facial movement and the relative importance of upper and lower areas of the face. Journal of personality and social psychology, 37(11):2049, 1979.

16. R. F. Baugh, G. J. Basura, L. E. Ishii, S. R. Schwartz, C. M. Drumheller, and R. Burkholder. Clinical practice guideline bells palsy. Otolaryngology–Head and Neck Surgery, 149:S1–S27, 2013.

17. D. Bazo, R. Vaidyanathan, A. Lentz, and C. Melhuish. Design and testing of a hybrid expressive face for a humanoid robot. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5317–5322, 2010.

18. C. Becker-Asano and H. Ishiguro. Evaluating facial displays of emotion for the an- droid robot Geminoid F. In IEEE Workshop on Affective Computational Intelligence (WACI), pages 1–8, 2011.

19. T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R. W. Sumner, and M. Gross. High-quality passive facial performance capture using anchor frames. In ACM Transactions on Graphics (TOG), volume 30, page 75, 2011.

20. C. C. Bennett and S. Sabanoviˇ c.´ Deriving minimal features for human-like facial expressions in robotic faces. International Journal of Social Robotics, 6(3):367–381, 2014.

21. C. Benoit, J.-C. Martin, C. Pelachaud, L. Schomaker, and B. Suhm. Audio-visual and multimodal speech systems. Handbook of Standards and Resources for Spoken Language Systems-Supplement, 500:12–36, 2000.

22. S. F. Bernardes and M. L. Lima. On the contextual nature of sex-related biases in pain judgments: The effects of pain duration, patients distress and judges sex. European Journal of Pain, 15(9), 2011.

131 23. K. Berns and J. Hirth. Control of facial expressions of the humanoid robot head roman. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3119–3124, 2006. 24. D. Berry, L. Butler, F. de Rosis, J. Laaksolathi, C. Pelachaud, and M. Steedman. Final evaluation report, 2003. 25. B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, and M. Gross. Multi-scale capture of facial geometry and motion. In ACM Transactions on Graphics (TOG), volume 26, page 33. ACM, 2007. 26. B. Bickel, P. Kaufmann, M. Skouras, B. Thomaszewski, D. Bradley, T. Beeler, P. Jackson, S. Marschner, W. Matusik, and M. Gross. Physical face cloning. ACM Transactions on Graphics (TOG), 31(4):118, 2012. 27. D. Blanch-Hartigan, S. A. Andrzejewski, and K. M. Hill. The effectiveness of train- ing to improve person perception accuracy: a meta-analysis. Basic and Applied So- cial Psychology, 34(6):483–498, 2012. 28. K. R. Bogart and L. Tickle-Degnen. Looking beyond the face: A training to im- prove perceivers impressions of people with facial paralysis. Patient education and counseling, 98(2):251–256, 2015. 29. K. R. Bogart, L. Tickle-Degnen, and N. Ambady. Communicating without the face: Holistic perception of emotions of people with facial paralysis. Basic and Applied Social Psychology, 36(4):309–320, 2014. 30. S. M. Boker, J. F. Cohn, B.-J. Theobald, I. Matthews, T. R. Brick, and J. R. Spies. Ef- fects of damping head movement and facial expression in dyadic conversation using real–time facial expression tracking and synthesized avatars. Philosophical Transac- tions of the Royal Society of London B: Biological Sciences, 364(1535):3485–3495, 2009. 31. C. Breazeal. Affective interaction between humans and robots. Springer, 2001. 32. C. Breazeal. Designing social robots. The MIT Press, 2002. 33. C. Breazeal. Emotion and sociable humanoid robots. International Journal of Human-Computer Studies, 59(1):119–155, 2003. 34. C. Breazeal and B. Scassellati. Infant-like social interactions between a robot and a human caregiver. Adaptive Behavior, 8(1):49–74, 2000. 35. S. E. Brennan. Caricature generator: The dynamic exaggeration of faces by computer. Leonardo, 18(3):170–178, 1985. 36. P. Briggs, M. Scheutz, and L. Tickle-Degnen. Are robots ready for administering health status surveys’: First results from an hri study with subjects with parkinson’s disease. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pages 327–334, 2015.

132 37. J. Brown. How clinical communication has become a core part of medical education in the UK. Medical Education, 42(3), 2008.

38. D. Cameron, S. Fernando, E. Collins, A. Millings, R. Moore, A. Sharkey, V. Evers, and T. Prescott. Presence of life-like robot expressions influences childrens enjoy- ment of human-robot interactions in the field. In Proceedings of the Symposium on New Frontiers in Human-Robot Interaction (AISB), 2015.

39. J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Dou- ville, S. Prevost, and M. Stone. Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pages 413–420. ACM, 1994.

40. S.-C. Chen, T.-S. Kao, W.-K. Chen, J.-I. Hsu, et al. Retrieval of motion capture data aids efficient digital learning. Journal of Education and Management Engineering, 2011.

41. S. W. Chew, P. Lucey, S. Lucey, J. Saragih, J. F. Cohn, and S. Sridharan. Person- independent facial expression detection using constrained local models. In IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2011.

42. N. Christoph. Empirical evaluation methodology for embodied conversational agents. In From brows to trust, pages 67–90. Springer, 2005.

43. M.-P. Coll, M. Gregoire,´ M. Latimer, F. Eugene,` and P. L. Jackson. Perception of pain in others: implication for caregivers. Pain Management, 1(3):257–265, 2011.

44. T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681–685, 2001.

45. K. D. Craig. The social communication model of pain. Canadian Psychol- ogy/Psychologie canadienne, 50(1):22, 2009.

46. D. Cristinacce and T. F. Cootes. Feature detection and tracking with constrained local models. In British Machine Vision Conference, volume 2, page 6, 2006.

47. D. Cristinacce and T. F. Cootes. Feature detection and tracking with constrained local models. In British Machine Vision Conference, volume 17, pages 929–938, 2006.

48. S. K. Das. Realistic interaction with social robots via facial expressions and neck-eye coordination. Master’s thesis, The University of Texas at Arlington, 2015.

49. F. De Rosis, C. Pelachaud, I. Poggi, V. Carofiglio, and B. De Carolis. From greta’s mind to her face: modelling the dynamics of affective states in a conversational em- bodied agent. International journal of human-computer studies, 59(1):81–118, 2003.

133 50. A. M. Deladisma, M. Cohen, A. Stevens, P. Wagner, B. Lok, T. Bernard, C. Oxen- dine, L. Schumacher, K. Johnsen, R. Dickerson, et al. Do medical students respond empathetically to a virtual patient? The American Journal of Surgery, 193(6):756– 760, 2007.

51.M.D ´ıaz-Boladeras, D. Paillacho, C. Angulo, O. Torres, J. Gonzalez-Di´ eguez,´ and J. Albo-Canals. Evaluating group-robot interaction in crowded public spaces: A week-long exploratory study in the wild with a humanoid robot guiding visitors through a science museum. International Journal of Humanoid Robotics, 12(04): 1550022, 2015.

52. P. F. Dominey and F. Warneken. The basis of shared intentions in human and robot cognition. New Ideas in Psychology, 29(3):260–274, 2011.

53. E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J.-C. Mar- tin, L. Devillers, S. Abrilian, A. Batliner, et al. The humaine database: addressing the collection and annotation of naturalistic and induced emotional data. In Affective computing and intelligent interaction, pages 488–500. Springer, 2007.

54. C. H. Ek, P. Jaeckel, N. Campbell, N. D. Lawrence, and C. Melhuish. Shared gaussian process latent variable models for handling ambiguous facial expressions. In Ameri- can Institute of Physics Conference Series, volume 1107, pages 147–153, 2009.

55. P. Ekman and W. V. Friesen. Facial action coding system. Consulting Psychologists Press, Stanford University, Palo Alto, 1977.

56. P. Ekman and E. L. Rosenberg. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System. Oxford, 1997.

57. A. Eksiri and T. Kimura. Restaurant service robots development in thailand and their real environment evaluation. Journal of Robotics and Mechatronics, 27(1):1, 2015.

58. R. El Kaliouby and P. Robinson. Generalization of a vision-based computational model of mind-reading. In Affective computing and intelligent interaction, pages 582–589. Springer, 2005.

59. T. Erez, K. Lowrey, Y. Tassa, V. Kumar, S. Kolev, and E. Todorov. An integrated system for real-time model predictive control of humanoid robots. In 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pages 292–299, 2013.

60. T. J. Eviston, G. R. Croxson, P. G. Kennedy, T. Hadlock, and A. V. Krishnan. Bell’s palsy: aetiology, clinical features and multidisciplinary care. Journal of Neurology, Neurosurgery & Psychiatry, 86(12):1356–1361, 2015.

61. F. Eyssel, D. Kuchenbrandt, and S. Bobinger. Effects of anticipated human-robot interaction and predictability of robot behavior on perceptions of anthropomorphism. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 61–68. ACM, 2011.

134 62. A. Franchi, C. Secchi, M. Ryll, H. H. Bulthoff,¨ and P. R. Giordano. Shared control: Balancing autonomy and human assistance with a group of quadrotor uavs. IEEE Robotics & Automation Magazine, 19(3):57–68, 2012.

63. A. J. Giannini, J. D. Giannini, and R. K. Bowman. Measurement of nonverbal re- ceptive abilities in medical students. Perceptual and motor skills, 90(3c):1145–1150, 2000.

64. J. Goetz, S. Kiesler, and A. Powers. Matching robot appearance and behavior to tasks to improve human-robot cooperation. In The 12th IEEE International Workshop on Robot and Human Interactive Communication (ROMAN), pages 55–60, 2003.

65. M. J. Gonzales, M. Moosaei, and L. D. Riek. A novel method for synthesizing naturalistic pain on virtual patients. volume 6. Simulation in Healthcare, 2013.

66. M. J. Gonzales, M. F. O’Connor, and L. D. Riek. Contextual constraints for the design of patient-centered health IT tools. Stud Health Technol Inform, 194:75–81, 2013.

67. M. J. Gonzales, V. C. Cheung, and L. D. Riek. Designing collaborative healthcare technology for the acute care workflow. In Proceedings of the 9th International Conference on Pervasive Computing Technologies for Healthcare, 2015.

68. M. J. Gonzales, J. M. Henry, A. W. Calhoun, and L. D. Riek. Visual TASK: A collab- orative cognitive aid for acute care resuscitation. 10th EAI International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health), pages 1–8, 2016.

69. K. Gouta and M. Miyamoto. Emotion recognition: facial components associated with various emotions. Shinrigaku kenkyu: The Japanese journal of psychology, 71 (3):211–218, 2000.

70. L. Gralewski, N. Campbell, B. Thomas, C. Dalton, and D. Gibson. Statistical synthe- sis of facial expressions for the portrayal of emotion. In International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia, pages 190–198. ACM, 2004.

71. B. Guenter, C. Grimm, D. Wood, H. Malvar, and F. Pighin. Making faces. In Pro- ceedings of the 25th Annual Conference on Computer Graphics and Interactive Tech- niques, pages 55–66. ACM, 1998.

72. P. Gulbrandsen, B. F. Jensen, A. Finset, and D. Blanch-Hartigan. Long-term effect of communication training on the relationship between physicians self-efficacy and performance. Patient education and counseling, 91(2):180–185, 2013.

73. A. Habib, S. K. Das, I.-C. Bogdan, D. Hanson, and D. O. Popa. Learning human-like facial expressions for Android Phillip K. Dick. In IEEE International Conference on Automation Science and Engineering (CASE), pages 1159–1165, 2014.

135 74. T. Hadjistavropoulos, K. D. Craig, and S. Fuchs-Lacelle. Social influences and the communication of pain. Pain: Psychological Perspectives, 2004.

75. Z. Hammal and J. F. Cohn. Automatic detection of pain intensity. ACM International Conference on Multimodal Interaction, 2012.

76. P. A. Hancock, D. R. Billings, K. E. Schaefer, J. Y. Chen, E. J. De Visser, and R. Para- suraman. A meta-analysis of factors affecting trust in human-robot interaction. Hu- man Factors, 53(5):517–527, 2011.

77. D. Hanson. Exploring the aesthetic range for humanoid robots. In Proceedings of the ICCS/CogSci-2006 long symposium: Toward social mechanisms of , pages 39–42. Citeseer, 2006.

78. C. J. Hayes, C. R. Crowell, and L. D. Riek. Automatic processing of irrelevant co- speech gestures with human but not robot actors. In 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2013.

79. C. J. Hayes, M. Moosaei, and L. D. Riek. Exploring implicit human responses to robot mistakes in a learning from demonstration task. In IEEE International Sympo- sium on Robot and Human Interactive Communication (RO-MAN), 2016.

80. E. A. Henneman, J. P. Roche, D. L. Fisher, H. Cunningham, C. A. Reilly, B. H. Nathanson, and P. L. Henneman. Error identification and recovery by student nurses using human patient simulation: Opportunity to improve patient safety. APPL NURS RES, 23(1), 2010.

81. S. G. Henry, A. Fuhrel-Forbis, M. A. Rogers, and S. Eggly. Association between nonverbal communication during clinical interactions and outcomes: A systematic review and meta-analysis. Patient Education and Counseling, 86(3), 2012.

82. A. T. Hirsh, A. F. Alqudah, L. A. Stutts, and M. E. Robinson. Virtual human tech- nology: Capturing sex, race, and age influences in individual pain decision policies. Pain, 140(1), 2008.

83. M. Hojat, M. J. Vergare, K. Maxwell, G. Brainard, S. K. Herrine, G. A. Isenberg, J. Veloski, and J. S. Gonnella. The devil is in the third year: a longitudinal study of erosion of empathy in medical school. Academic Medicine, 84(9):1182–1191, 2009.

84. A. Huus and L. D. Riek. An Expressive Robotic Patient to Improve Clinical Commu- nication. In 7th ACM International Conference on Human-Robot Interaction (HRI), Pioneers Workshop, 2012.

85. T. Iqbal and L. D. Riek. A method for automatic detection of psychomotor entrain- ment. IEEE Transactions on affective computing, 7(1):3–16, 2016.

86. T. Iqbal and L. D. Riek. Coordination dynamics in multi-human multi-robot teams. IEEE Robotics and Automation Letters, 2017.

136 87. T. Iqbal, M. Moosaei, and L. D. Riek. Tempo adaptation and anticipation meth- ods for human-robot teams. In Proceedings of the Robotics: Science and Systems (RSS), Planning for Human-Robot Interaction: Shared Autonomy and Collaborative Robotics Workshop, 2016.

88. T. Iqbal, S. Rack, and L. D. Riek. Movement coordination in human-robot teams: A dynamical systems approach. IEEE Transactions on Robotics, 32(4):909–919, 2016.

89. P. Jaeckel, N. Campbell, and C. Melhuish. Towards realistic facial behaviour in humanoids-mapping from video footage to a robot head. In IEEE International Con- ference on Rehabilitation Robotics (ICORR), pages 833–840, 2007.

90. P. Jaeckel, N. Campbell, and C. Melhuish. Facial behaviour mappingfrom video footage to a robot head. Robotics and Autonomous Systems, 56(12):1042–1049, 2008.

91. A. Janiw, L. Woodrick, and L. D. Riek. Patient situational awareness support appears to fall with advancing levels of nursing student education. Simulation in Healthcare, 2013.

92. J. Jansen, J. C. van Weert, J. de Groot, S. van Dulmen, T. J. Heeren, and J. M. Bensing. Emotional and informational patient cues: the impact of nurses responses on recall. Patient education and counseling, 79(2):218–224, 2010.

93. J. Kappesser and A. C. d. C. Williams. Pain and negative emotions in the face: judgements by health care professionals. Pain, 99(1), 2002.

94. P. Kenny, T. D. Parsons, J. Gratch, A. Leuski, and A. A. Rizzo. Virtual patients for clinical therapist skills training. In Intelligent Virtual Agents. Springer, 2007.

95. C. D. Kidd. Sociable robots: The role of presence and task in human-robot interac- tion. PhD thesis, 2003.

96. C. D. Kidd and C. Breazeal. Effect of a robot on user perceptions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3559– 3564, 2004.

97. S. Kiesler, A. Powers, S. R. Fussell, and C. Torrey. Anthropomorphic interactions with a robot and robot-like agent. Social Cognition, 26(2):169, 2008.

98. L. Kirmayer. Rethinking cultural competence. Transcultural Psychiatry, 49(2):149, 2012.

99. T. Kishi, T. Otani, N. Endo, P. Kryczka, K. Hashimoto, K. Nakata, and A. Takanishi. Development of expressive robotic head for bipedal humanoid robot. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4584– 4589, 2012.

137 100. H. Knight. Eight lessons learned about non-verbal interactions through robot theater. In Social Robotics, pages 42–51. Springer, 2011.

101. H. Kozima, C. Nakagawa, and H. Yano. Using robots for the study of human social development. In AAAI Spring Symposium on Developmental Robotics, pages 111– 114, 2005.

102. B. Le Goff. Synthese` a` partir du texte de visage 3 D parlant franc¸ais. PhD thesis, 1997.

103. H. S. Lee, J. W. Park, and M. J. Chung. A linear affect–expression space model and control points for mascot-type facial robots. IEEE Transactions on Robotics, 23(5): 863–873, 2007.

104. Y. Lee, D. Terzopoulos, and K. Waters. Realistic modeling for facial animation. In Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pages 55–62. ACM, 1995.

105. D. Leidner, P. Birkenkampf, N. Y. Lii, and C. Borst. Enhancing supervised autonomy for extraterrestrial applications by sharing knowledge between humans and robots. In Workshop on How to Make Best Use of a Human Supervisor for Semi-Autonomous Humanoid Operation, IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2014.

106. M. Leonard. The human factor: the critical importance of effective teamwork and communication in providing safe care. QUAL SAF HEALTH CARE, 13, 2004.

107. W. Levinson, R. Gorawara-Bhat, and J. Lamb. A study of patient clues and physician responses in primary care and surgical settings. Jama, 284(8):1021–1027, 2000.

108. A. X. Li, M. Florendo, L. E. Miller, H. Ishiguro, and A. P. Saygin. Robot form and motion influences social attention. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, pages 43–50. ACM, 2015.

109. J. Li, R. Kizilcec, J. Bailenson, and W. Ju. Social robots and virtual agents as lecturers for video instruction. Computers in Human Behavior, 55:1222–1230, 2016.

110. X. Li, B. MacDonald, and C. I. Watson. Expressive facial speech synthesis on a robotic platform. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5009–5014, 2009.

111. J. Lieberman and C. Breazeal. Improvements on action parsing and action interpola- tion for learning through demonstration. In IEEE Humanoids ’04, volume 1, pages 342–365, 2004.

112. P. Lucey, J. F. Cohn, S. Lucey, I. Matthews, S. Sridharan, and K. M. Prkachin. Au- tomatically detecting pain using facial actions. In 3rd International Conference on Affective Computing and Intelligent Interaction (ACII), 2009.

138 113. P. Lucey, J. F. Cohn, K. M. Prkachin, P. E. Solomon, and I. Matthews. Painful data: The unbc-mcmaster shoulder pain expression archive database. In IEEE Interna- tional Conference on Automatic Face & Gesture Recognition, 2011.

114. L. R. Martin and H. S. Friedman. Nonverbal communication and health care. Appli- cations of Nonverbal Communication, 2005.

115. D. Mazzei, L. Billeci, A. Armato, N. Lazzeri, A. Cisternino, G. Pioggia, R. Igliozzi, F. Muratori, A. Ahluwalia, and D. De Rossi. The face of autism. In IEEE Inter- national Symposium on Robot and Human Interactive Communication (RO-MAN), pages 791–796, 2010.

116. D. Mazzei, N. Lazzeri, D. Hanson, and D. De Rossi. Hefes: An hybrid engine for facial expressions synthesis to control human-like androids and avatars. In IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatron- ics (BioRob), pages 195–200, 2012.

117. S. E. Mitchell and et al. Developing virtual patient advocate technology for shared decision making. In 34th Annual Meeting of the Society for Medical Decision Mak- ing, 2012.

118. M. M. Monwar and S. Rezaei. Pain recognition using artificial neural network. In IEEE Symposium on Signal Processing and Information Technology, 2006.

119. M. Moosaei and L. D. Riek. Evaluating facial expression synthesis on robots. In In proceedings of HRI Workshop on Applications for Emotional Robots at the 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

120. M. Moosaei and L. D. Riek. Pain synthesis for humanoid robots. In Midwest Robotics Workshop (MWRW) at the Toyota Technical Institute of Chicago (TTIC), 2016.

121. M. Moosaei, M. J. Gonzales, and L. D. Riek. Naturalistic pain synthesis for virtual patients. In Intelligent Virtual Agents, pages 295–309. Springer, 2014.

122. M. Moosaei, C. J. Hayes, and L. D. Riek. Facial expression synthesis on robots: An ros module. In Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction Extended Abstracts, pages 169–170, 2015.

123. M. Moosaei, C. J. Hayes, and L. D. Riek. Performing facial expression synthesis on robot faces: A real-time software system. In Proceedings of the 4th International Symposium on New Frontiers in Human-Robot Interaction (AISB), 2015.

124. M. Moosaei, S. K. Das, D. O. Popa, and L. D. Riek. Using facially expressive robots to calibrate clinical pain perception. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 32–41, 2017.

125. L.-P. Morency, J. Whitehill, and J. Movellan. Generalized adaptive view-based ap- pearance model: Integrated framework for monocular head pose estimation. In

139 IEEE International Conference on Automatic Face & Gesture Recognition, pages 1–8, 2008.

126. D. S. Morse, E. A. Edwardsen, and H. S. Gordon. Missed opportunities for interval empathy in lung cancer communication. Archives of Internal Medicine, 168(17): 1853–1858, 2008.

127. M. B. Moussa, Z. Kasap, N. Magnenat-Thalmann, and D. Hanson. MPEG-4 FAP animation applied to humanoid robot head. Proceeding of Summer School Engage, 2010.

128. R. R. Murphy, J. Kravitz, S. L. Stover, and R. Shoureshi. Mobile robots in mine rescue and recovery. IEEE Robotics & Automation Magazine, 16(2):91–103, 2009.

129. B. Mutlu, F. Yamaoka, T. Kanda, H. Ishiguro, and N. Hagita. Nonverbal leakage in robots: communication of intentions through seemingly unintentional behavior. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 69–76. ACM, 2009.

130. S. Nakaoka, F. Kanehiro, K. Miura, M. Morisawa, K. Fujiwara, K. Kaneko, S. Kajita, and H. Hirukawa. Creating facial motions of cybernetic human HRP-4C. In IEEE- RAS International Conference on Humanoid Robots (Humanoids), pages 561–567, 2009.

131. M. Ochs, R. Niewiadomski, C. Pelachaud, and D. Sadek. Intelligent expressions of emotions. In Affective Computing and Intelligent Interaction, pages 707–714. Springer, 2005.

132. M. Ochs, C. Pelachaud, and D. Sadek. An empathic virtual dialog agent to improve human-machine interaction. In 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pages 89–96, 2008.

133. M. Ochs, R. Niewiadomski, and C. Pelachaud. How a virtual agent should smile? In Intelligent Virtual Agents, pages 427–440. Springer, 2010.

134. J.-H. Oh, D. Hanson, W.-S. Kim, Y. Han, J.-Y. Kim, and I.-W. Park. Design of android type humanoid robot albert . In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1428–1433, 2006.

135. A. Paiva, J. Dias, D. Sobral, R. Aylett, P. Sobreperez, S. Woods, C. Zoll, and L. Hall. Caring for agents and agents that care: Building empathic relations with synthetic agents. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 194–201, 2004.

136. M. Pantic and L. J. Rothkrantz. Facial action recognition for facial expression anal- ysis from static face images. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 34(3):1449–1461, 2004.

140 137. M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-based database for facial expression analysis. In IEEE International Conference on Multimedia and Expo (ICME), 2005.

138. C. Pelachaud. Some considerations about embodied agents. In Proc. of the Workshop on Achieving Human-Like Behavior in Interactive Animated Agents, in The Fourth International Conference on Autonomous Agents, 2000.

139. K. M. Prkachin and K. D. Craig. Expressing pain: The communication and interpre- tation of facial pain signals. Journal of Nonverbal Behavior, 19(4), 1995.

140. K. M. Prkachin, S. Berzins, and S. R. Mercer. Encoding and decoding of pain ex- pressions: a judgement study. Pain, 58(2), 1994.

141. K. M. Prkachin, P. E. Solomon, and J. Ross. Underestimation of pain by health- care providers: towards a model of the process of inferring pain in others. Canadian Journal of Nursing Research (CJNR), 39(2):88–106, 2007.

142. L. Riek, N. Mavridis, S. Antali, N. Darmaki, Z. Ahmed, M. Al-Neyadi, and A. Alketheri. Ibn sina steps out: Exploring arabic attitudes toward humanoid robots. In Proceedings of the 2nd International Symposium on New Frontiers in Human- Robot Interaction, AISB, 2010.

143. L. D. Riek. Next generation patient simulators. In NSF CAREER award, Grant No. IIS-1253935.

144. L. D. Riek. Expression synthesis on robots. Ph.D. Dissertation, University of Cam- bridge, 2011.

145. L. D. Riek. The social co-robotics problem space: Six key challenges. Robotics Challenges and Vision (RCV), 2013.

146. L. D. Riek. Robotics technology in mental health care. In D. Luxton (Ed.), Artificial Intelligence in Behavioral Health and Mental Health Care. Elsevier, pages 185–203, 2015.

147. L. D. Riek. Healthcare robotics. Communications of the ACM, 2017.

148. L. D. Riek and P. Robinson. Challenges and opportunities in building socially intel- ligent machines. IEEE Signal Processing Magazine, 28(3):146–149, 2011.

149. L. D. Riek and P. Robinson. Using robots to help people habituate to visible disabil- ities. In IEEE International Conference on Rehabilitation Robotics (ICORR), pages 1–8, 2011.

150. L. D. Riek, P. C. Paul, and P. Robinson. When my robot smiles at me: Enabling human-robot rapport via real-time head gesture mimicry. Journal on Multimodal User Interfaces, 3(1-2):99–108, 2010.

141 151. L. D. Riek, T.-C. Rabinowitch, P. Bremner, A. G. Pipe, M. Fraser, and P. Robin- son. Cooperative gestures: Effective signaling for humanoid robots. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 61–68, 2010.

152. P. Riva, S. Sacchi, L. Montali, and A. Frigerio. Gender effects in pain detection: Speed and accuracy in decoding female and male pain expressions. European Journal of Pain, 2011.

153. K. L. Robey, P. M. Minihan, L. M. Long-Bellil, J. E. Hahn, J. G. Reiss, G. E. Eddey, A. for Disability in Health Care Education, et al. Teaching health care students about disability within a cultural competency context. Disability and Health Journal, 6(4): 271–279, 2013.

154. B. Romera-Paredes and et al. Transfer learning to account for idiosyncrasy in face and body expressions. In IEEE Face and Gesture, 2013.

155. rosserial. URL: http://wiki.ros.org/rosserial, 2014.

156. M. Rueben and W. D. Smart. A shared autonomy interface for household devices. In ACM/IEEE International Conference on Human-Robot Interaction, Extended Ab- stracts, pages 165–166, 2015.

157. J. A. Russell. Is there universal recognition of emotion from facial expressions? a review of the cross-cultural studies. Psychological bulletin, 115(1), 1994.

158. J. A. Russell and J. M. Fernandez-Dols.´ The psychology of facial expression. Cam- bridge university press, 1997.

159. K. F. Ryan. Human simulation for medicine. Human Simulation for Nursing and Health Professions, 2011.

160. J. M. Saragih, S. Lucey, and J. F. Cohn. Face alignment through subspace constrained mean-shifts. In IEEE International Conference on Computer Vision (ICCV), 2009.

161. J. M. Satterfield and E. Hughes. Emotion skills training for medical students: a systematic review. Medical education, 41(10):935–941, 2007.

162. M. Scheeff, J. Pinto, K. Rahardja, S. Snibbe, and R. Tow. Experiences with sparky, a social robot. Socially Intelligent Agents, pages 173–180, 2002.

163. D. J. Schiano, S. M. Ehrlich, and K. Sheridan. Categorical imperative not: facial af- fect is perceived continuously. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 49–56. ACM, 2004.

164. H. Shah, B. Rossen, B. Lok, D. Londino, S. D. Lind, and A. Foster. Interactive virtual-patient scenarios: An evolving tool in psychiatric education. Academic Psy- chiatry, 36(2), 2012.

142 165. M. Shayganfar, C. Rich, and C. L. Sidner. A design methodology for expressing emotion on robot faces. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4577–4583, 2012.

166. D. Simon, K. D. Craig, W. H. Miltner, and P. Rainville. Brain responses to dynamic facial expressions of pain. Pain, 126(1), 2006.

167. D. Simon, K. D. Craig, and et al. Recognition and discrimination of prototypical dynamic expressions of pain and emotions. Pain, 135, 2008.

168. S. Sosnowski, A. Bittermann, K. Kuhnlenz, and M. Buss. Design and evaluation of emotion-display EDDIE. In IEEE/RSJ International Conference on Intelligent Robots and Systems, (IROS), pages 3113–3118, 2006.

169. D. G. Stork and M. E. Hennecke. Speechreading by humans and machines: models, systems, and applications. Springer Science & Business Media, 1996.

170. H. Sunvisson, B. Habermann, S. Weiss, and P. Benner. Augmenting the cartesian medical discourse with an understanding of the person’s lifeworld, lived body, life story and social identity. Nursing Philosophy, 10(4):241–252, 2009.

171. Y. Tadesse, S. Priya, H. Stephanou, D. Popa, and D. Hanson. Piezoelectric actuation and sensing for facial robotics. Ferroelectrics, 345(1):13–25, 2006.

172. Y. Takase, H. Mitake, Y. Yamashita, and S. Hasegawa. Motion generation for the stuffed-toy robot. In Proceedings of IEEE SICE Annual Conference, pages 213–217, 2013.

173. L. Takayama, D. Dooley, and W. Ju. Expressing thought: improving robot readability with animation principles. In ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 69–76. ACM, 2011.

174. A. Takeuchi and T. Naito. Situated facial displays: towards social interaction. In ACM SIGCHI Conference on Human Factors in Computing Systems, pages 450–455, 1995.

175. A. Taylor and L. D. Riek. Robot perception of human groups in the real world: State of the art. In In Proceedings of the AAAI Fall Symposium on Artificial Intelligence in Human-Robot Interaction (AI-HRI), 2016.

176. A. Taylor and L. D. Riek. Robot affiliation perception for social interaction. In Proceedings of Robots in Groups and Teams Workshop at ACM Conference on Computer-Supported Cooperative Work and Social Computing, 2017.

177. D. Terzopoulos and K. Waters. Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6):569–579, 1993.

143 178. N. M. Thalmann, P. Kalra, and M. Escher. Face to virtual face. Proceedings of the IEEE, 86(5):870–883, 1998.

179. L. Tickle-Degnen, L. A. Zebrowitz, and H.-i. Ma. Culture, gender and health care stigma: Practitioners response to facial masking experienced by people with parkin- sons disease. Social Science & Medicine, 73(1):95–102, 2011.

180. L. Tickle-Degnen, M. Scheutz, and R. C. Arkin. Collaborative robots in rehabilitation for social self-management of health. 2014.

181. N. Tottenham and et al. The nimstim set of facial expressions: judgments from untrained research participants. Psychiatry Research, 168(3), 2009.

182. M. L. Walters, M. Lohse, M. Hanheide, B. Wrede, D. S. Syrdal, K. L. Koay, A. Green, H. Huttenrauch,¨ K. Dautenhahn, G. Sagerer, et al. Evaluating the robot personal- ity and verbal behavior of domestic robots using video-based studies. Advanced Robotics, 25(18):2233–2254, 2011.

183. X. Wan and X. Jin. Data-driven facial expression synthesis via laplacian deformation. Multimedia Tools and Applications, 58(1):109–123, 2012.

184. L. D. Wandner, J. E. Letzen, C. A. Torres, B. Lok, and M. E. Robinson. Using virtual human technology to provide immediate feedback about participants’ use of demographic cues and knowledge of their cue use. The Journal of Pain, 15(11):1141 – 1147, 2014.

185. A. Weiss. Creating service robots for and with people: A user-centered reflection on the interdisciplinary research field of human-robot interaction. In 15th Annual STS Conference Graz, Critical Issues in Science, Technology, and Society Studies, 2016.

186. A. C. d. C. Williams, H. T. O. Davies, and Y. Chadury. Simple pain rating scales hide complex idiosyncratic meanings. Pain, 85(3), 2000.

187. L. Williams. Performance-driven facial animation. In ACM SIGGRAPH Computer Graphics, volume 24, pages 235–242. ACM, 1990.

188. S. N. Woods, M. L. Walters, K. L. Koay, and K. Dautenhahn. Methodological issues in hri: A comparison of live and video-based methods in robot to human approach direction trials. In The 15th IEEE International Symposium on Robot and Human Interactive Communication (ROMAN), pages 51–58, 2006.

189. Y. Wu, Q. Huang, X. Chen, Z. Yu, L. Meng, G. Ma, P. Zhang, and W. Zhang. De- sign and similarity evaluation on humanoid facial expression. In IEEE International Conference on Mechatronics and Automation (ICMA), pages 1150–1155, 2015.

190. Q. Zhang, Z. Liu, G. Quo, D. Terzopoulos, and H.-Y. Shum. Geometry-driven pho- torealistic facial expression synthesis. IEEE Transactions on Visualization and Com- puter Graphics, 12(1):48–60, 2006.

144 191. X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. BP4D-Spontaneous: a high-resolution spontaneous 3d dynamic facial ex- pression database. Image and Vision Computing, 32(10):692–706, 2014.

192. C. Zimmermann, L. Del Piccolo, and A. Finset. Cues and concerns by patients in medical consultations: a literature review. Psychological bulletin, 133(3):438, 2007.

This document was prepared & typeset with pdfLATEX, and formatted with NDdiss2ε classfile (v3.2013[2013/04/16]) provided by Sameer Vijay and updated by Megan Patnott.

145