<<

LEARNING BASED VISUAL ENGAGEMENT AND SELF-EFFICACY

by

SVATI DHAMIJA

B.Tech., Electronics and Communications Engineering, Punjab Technical University, India 2007

M.Tech., Electronics and Communications Engineering, Guru Gobind Singh Indraprastha University, India

2011

A dissertation submitted to the Graduate Faculty of the

University of Colorado Colorado Springs

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2018 This dissertation for the Doctor of Philosophy degree by

Svati Dhamija

has been approved for the

Department of Computer Science

by

Terrance E. Boult, Chair 1

Charles C. Benight 2

Manuel Gunther¨ 1

Rory A. Lewis 1

Jonathan Ventura 3

Walter J. Scheirer 4

Date: 12-03-2018

1 T.Boult, R.Lewis and M.Gunther¨ are with University of Colorado Colorado Springs 2 C.Benight is with Psychology Department in University of Colorado Colorado Springs 3 J.Ventura is with California Polytechnic State University 4 W.Scheirer is with University of Notre Dame

ii Dhamija, Svati (Ph.D., Computer Science)

Learning Based Visual Engagement and Self-Efficacy

Dissertation directed by El Pomar Professor, Chair Terrance E. Boult

Abstract

With the advancements in the fields of Affective computing, and Human-computer interaction, along with the exponential development of intelligent machines, the study of human mind and behavior is emerging as an unrivaled mystery. A plethora of experiments are being carried out each day to make machines capable enough to sense subtle verbal & non-verbal human behaviors and understand human needs in every sphere of life. Numerous applications ranging from virtual assistants for online-learning to socially-assistive robots for health care, are being designed. A frequent challenge in such applications is creating scalable systems that are tailored to an individual’s capabilities, preferences and response to interaction.

Moreover, digital interventions that are tailored to induce cognitive and behavioral changes in individuals are common, but often lack tools for automated analysis and monitoring of user behavior while interacting with the application. Lack of automated monitoring is especially detrimental in aggravating societal problems like mental illnesses and disorders, where it is important to determine the psychological wellbeing of populations, to detect people at risk, to personalize and deliver interventions that work in real-world. Prior research indicates that web-interventions have the potential to advance patient-centered mental health care delivery and can estimate patient involvement by recognizing human emotions and affect. Personalized computing systems with capabilities of perception, rational and intelligent behavior can also be developed by observing bio-physiological responses and facial expressions. Though invention of such systems is slowly improving the prediction and detection of stress levels, it is farther away from adapting to an individual’s specific need of cure and advise coping methods. Given the stigma associated with mental disorders, people may not always seek assistance. Engagement or disengagement with applications can be an important indicator of individual’s

iii desire to be treated and whether they are interested in coping with their situation.

A scalable adaptive person-centered approach is therefore essential to non-invasively monitor user- engagement and response to the application. In this work, we present a novel vision and learning based approach for web-based interventions with the use case of trauma-recovery. This thesis comprises of my recent research which develops contextual engagement, mood-aware engagement and multi-modal algorithms in a non-intrusive setting with EASE (Engagement Arousal & Self-Efficacy), by calibrating

Engagement from face videos and physiological Arousal leading towards a hidden state of Self-Efficacy.

iv Dedicated to my family.

Thank you for always being there for me. Acknowledgements

First and foremost, I would like to express my sincerest regards to my advisor Dr. Terrance E. Boult. Without his guidance and supervision, this thesis would not be possible. His unique ways of mentoring and vision of the big picture inspired me to think objectively and taught me the fundamentals of research. I would also like to extend gratitude to my committee members: Dr. Benight, Dr. Gunther,¨ Dr. Ventura, Dr. Scheirer and Dr. Lewis for setting high expectations and propelling me in fruitful directions. I thank the financial support of NSF Research Grant SCH-INT 1418520 “Learning and Sensory-based Modeling for Adaptive

Web-Empowerment Trauma Treatment” for this work.

I am grateful to numerous VAST lab members who made the workplace fun and engaging, Archana,

Abhijit, Chloe, Steve, Manuel, Lalit, Gauri, Dubari, Khang, Ankita, Dan R., Caleb, James, Ethan, Marc,

Bill, Yousef, Adria and Akshay. I would also like to acknowledge the support of my fellow researchers from psychology, Kotaro, Carrie, Austin, Pamela, Amanda and Shaun, who worked alongside me through the tardy and laborious process of data collection. Special thanks to Ginger for the numerous adventures we had together, including ski trips, hikes, city tours, fun-filled dinners, board games etc. She made colorado springs a home away from home. My encouraging friends Pooja, Reetu, Komal, Cherry, Radhika, Silvee, Himani and Archana, who make life happier in many ways despite the long distances and different time-zones that separate us.

Last but not the least, I thank my parents and in-laws for being the source of inspiration and instilling me with courage. My grandparents for their unconditional love and blessings. Mumma and Papa for their countless sacrifices and prayers over the years. Thank you for keeping me grounded, making me aware of glass ceilings and trusting my abilities to break them. My brother, Akshay, for being my decisive and reassuring agony-uncle. My husband, Abhijit, for being my pillar of strength, friend and mentor. Thank you for braving this journey with me. Finally, I thank the almighty for watching over me through the activities of mind, senses and emotions.

vi Contents

1 Introduction 1

1.1 Why Engagement? ...... 2

1.2 Why Self-Efficacy? ...... 3

1.3 Background of application space : Trauma-Recovery ...... 3

1.4 Limitations of prior work ...... 5

1.5 Scope of this research ...... 7

1.6 Contributions from this thesis ...... 8

1.6.1 Dataset Collection ...... 8

1.6.2 Contextual Engagement Prediction from video ...... 9

1.6.3 Mood-aware Engagement Prediction ...... 9

1.6.4 Automated Action Units vs. Expert Raters ...... 9

1.6.5 Self-Efficacy prediction: Problem formulation ...... 10

1.6.6 Predicting Change in Self-Efficacy using EA features ...... 10

1.7 Publications from this work ...... 11

1.7.1 First author publications ...... 11

1.7.2 Other work ...... 11

2 EASE Dataset Collection 13

2.1 Introduction ...... 13

vii 2.2 Related Work ...... 13

2.3 EASE Data Collection Setup ...... 15

2.4 Participants and Demographics ...... 16

2.5 The EASE space variables ...... 17

2.5.1 Engagement measures ...... 17

2.5.2 Arousal data ...... 19

2.5.3 Trauma-recovery outcomes and dependent variables ...... 20

2.6 Challenges ...... 21

3 Contextual Engagement Prediction from video 23

3.1 Introduction ...... 23

3.2 Related work ...... 26

3.3 Algorithms for Sequence Learning ...... 29

3.3.1 Recurrent Neural Network ...... 30

3.3.2 Gated Recurrent Units ...... 31

3.3.3 Long Short-Term Memory ...... 32

3.4 Experiments ...... 33

3.4.1 Facial action units extraction ...... 33

3.4.2 SVC model ...... 35

3.4.3 Sequence learning models and tuning ...... 35

3.4.4 Context-specific, Cross-context and Combined models ...... 36

3.4.5 Varying data-segment length ...... 36

3.5 Evaluation results ...... 38

3.5.1 Contextual engagement results ...... 38

3.5.2 Comparison results for sequence learning algorithms ...... 38

3.6 Discussion ...... 42

3.6.1 Performance of Contextual Engagement Models ...... 42

viii 3.6.2 Incorporating Broader Range of Data ...... 44

4 Mood-Aware Engagement Prediction 46

4.1 Introduction ...... 46

4.2 Related Work ...... 49

4.3 Experiments ...... 50

4.3.1 Profile of Mood States (POMS) ...... 50

4.3.2 Facial action units extraction ...... 51

4.3.3 LSTM Engagement and Mood Models ...... 52

4.3.4 Cross-Validation and Train-test pipeline ...... 54

4.4 Evaluation Results ...... 56

4.5 Broader Impact ...... 57

4.5.1 Robust Mood and Mood-Aware Engagement Models ...... 59

5 Automated Action Units Vs. Expert Raters 61

5.1 Introduction ...... 61

5.2 Related Work ...... 63

5.3 Sequential learning model using RNN: automated, expert raters ...... 65

5.4 EASE Subset & Annotations ...... 66

5.5 Machine Learning Models ...... 68

5.6 Model Training/Tuning ...... 71

5.7 Results ...... 73

5.8 Discussion ...... 75

6 Self-Efficacy: Problem formulation and Computation 77

6.1 Introduction ...... 77

6.2 EASE Coping Self-Efficacy Measures ...... 79

6.3 Effectiveness of CSE in estimating PTSD symptom severity ...... 80

ix 6.4 Observations from related work ...... 84

6.4.1 Engagement and Self-Efficacy ...... 86

6.4.2 Heart Rate Variability and Self-Efficacy ...... 87

6.5 Defining CSE as a Machine Learning problem ...... 89

6.6 CSE Prediction Pipeline ...... 90

6.7 EASE subset and Data distribution ...... 91

6.8 Experiments ...... 92

6.9 Results & Discussion ...... 94

6.10 Broader Impact ...... 96

7 Conclusion and Insights 98

Appendices 101

A Psychometric Tests & Questionnaires 102

A.1 Profile of Mood States-Short Form : EASE version ...... 102

A.2 Coping Self-Efficacy for Trauma ...... 104

A.3 PTSD Checklist (PCL-5 Questionnaire) ...... 104

B Institutional Review Board Approval 107

B.1 Certificate: Conduct of Research for Engineers ...... 108

B.2 Certificate: Human, Social and Behavioral Research ...... 109

x List of Figures

2.1 Setup for EASE data collection ...... 15

2.2 Self reported engagement ...... 18

2.3 EASE Data Distribution ...... 19

2.4 Engagement annotation from CARMA ...... 20

3.1 Contextual Engagement Prediction Pipeline ...... 24

3.2 Action Units from OpenFace ...... 34

3.3 AU signal inputs for engagement prediction ...... 37

3.4 Confusion matrices from engagement models ...... 39

3.5 Comparison of RNN, LSTM and GRU boxplots ...... 43

4.1 Automated Mood Prediction for Mood-Aware Context Sensitive Engagement Prediction . . 47

4.2 Changes in subject moods pre/post module in each session ...... 52

4.3 Video segment selection ...... 53

4.4 Mood-aware models Convergence Plot ...... 54

4.5 Automated Mood Predictors ...... 58

5.1 Machine learning models that estimate user engagement ...... 63

5.2 Engagement annotations from external observations ...... 67

5.3 Automated AUs, rater annotations and subject self-reports ...... 69

xi 5.4 Machine Learning Models using RNNs ...... 71

5.5 Model tuning for FAVE ...... 72

6.1 Self-Reports collected in EASE: ...... 81

6.2 CSE as a predictor of self-reported engagement ...... 86

6.3 Changes in CSE with HRV ...... 88

6.4 CSE prediction system ...... 90

6.5 EASE subset details ...... 92

6.6 Data distribution of CSEt ...... 92

6.7 Preprocessing of signals ...... 94

6.8 Convergence plot for CSEt estimation ...... 95

7.1 System Diagram for Trauma Symptom Severity Prediction ...... 100

xii List of Tables

2.1 EASE dataset content summary ...... 17

2.2 PTSD specific variables and measurements ...... 21

3.1 Engagement prediction accuracy ...... 36

3.2 Contextual and Cross-contextual engagement prediction results ...... 38

3.3 Performance results RNN, LSTM and GRU ...... 40

3.4 Results of different time-windows ...... 40

4.1 Effect of POMS on contextual engagement prediction ...... 56

4.2 Mood-aware engagement results ...... 56

5.1 Data distribution from EASE subset ...... 68

5.2 Results from RA, RAV, AU and FAVE models ...... 75

6.1 MSEs comparison of trauma-severity models at a global-level ...... 82

6.2 MSEs comparison of trauma-severity models at a symptom-level ...... 83

6.3 Cross-validation results of CSEt estimation ...... 95

xiii Chapter 1

Introduction

The popularity of smart devices has led many people to turn to the web for instant answers and advice. The internet is filled with numerous self-help resources and clinically approved applications on health-related topics, ranging from managing weight-loss 1 to coping with mental stress 2. Although such applications are easy to access and significantly reduce health-care costs [33, 183], they are built on generic models designed for specific treatment outcomes and do not take into account the mental state of users. Moreover, such applications or web-interventions measure treatment effects from standard psychometric questionnaires, answered by users themselves. Even though questionnaires are simple and inexpensive, they distract the user from the task at hand or increase cognitive load. Moreover, answering a long list of questions i.e. providing self-reports is problematic and intrusive, especially for people suffering from mental-disorders whose symptoms include but are not limited to: reduced ability to concentrate, trouble understanding and frequent mood-swings.

In this thesis, we create a machine-learning based pipeline to continuously monitor changes in user mental state without affecting user-experience. We will employ series of signal processing, com- puter vision and machine learning techniques on video and sensory data collected for applications in

Trauma-Recovery elaborated in section 1.3. Throughout the thesis, we focus on critical elements of web-interventions, namely “Engagement” and “Self-Efficacy” that are relevant to most self-help/self-

1 https://www.dashforhealth.com/ 2 https://thiswayup.org.au/

1 regulated applications.

The layout of this document is as follows: in Chapter 1 we provide a brief overview of the motivation for this research, background on trauma-recovery and the scope of the technical problem. Chapter 2, presents the dataset that are used for all experiments in this thesis. Chapter 3, 4 and 5 dig deeper into the problem of continuous engagement prediction along with the solutions developed till date. Then, in Chapter

6, we build on a well founded Social Cognitive Theory (SCT) to define the concept of Self-Efficacy and formulate it as a machine-learning problem along with implementing and evaluating a computational model of self-efficacy. Chapter 7 summarizes the research work of this thesis, shortcomings and future directions.

Appendix A comprises of the psychometric questionnaires used in this work and lastly, Appendix B includes the mandatory certifications for research on human subjects.

1.1 Why Engagement?

Engagement is an indispensable part of user experience and interaction with applications. It is abstract, complex and changes over time. Self-care websites often suffer from disengagement issues and lack the tools to measure user engagement in a reliable and automated manner [161, 76, 133]. A significant section of the mental-health community is trying to determine and enhance user-specific engagement [104, 84, 204].

Moreover, lack of engagement is often associated with increased dropout rates [64, 103, 152]. Web-based interventions, therefore, must not only satisfy the needs of a patient but also be adequately engaging for increasing patient retention.

Interventions should also have reliable methods of forecasting user engagement and be capable of adaptation based on user engagement levels. Continuous prediction of engagement can help create adaptive websites, maximize outcomes and reduce dropouts. Continuous engagement prediction can have far reaching effects in not just web-based mental health intervention techniques, but also in online student learning [58, 141], intelligent student tutoring [73] and creating socially intelligent human-robots [126].

2 1.2 Why Self-Efficacy?

The term Self-Efficacy was coined by Bandura in 1977 [12], with his publication of “Self-efficacy: Toward a Unifying Theory of Behavioral Change”. Since then, self-efficacy has been a well know phenomenon in psychology, which is known to have a widespread impact on psychological states, behavior, motivation etc.

Self-Efficacy has been shown to have strong predictive value for many types of activities including health

[163], stress [140], commitment [5], leadership [195], education [213] etc. because it emphasizes important values such as achievement and performance [147]. State of the art treatment in trauma recovery is developed from SCT[26]. We seek to build machine learning techniques on this proven theory to automate efforts in trauma treatment.

Efficacy beliefs can determine whether people will invest effort, and how long they will persist in their effort in the face of obstacles and aversive experiences. People who have low self-efficacy, have low self-belief and weak commitment to the goals they choose to pursue. When faced with difficult tasks, they dwell on their personal deficiencies, on the obstacles they will encounter, and all kinds of adverse outcomes rather than trying to concentrate on how to perform successfully [18]. They slacken their efforts and give up quickly in the face of difficulties. On the contrary, people who believe they can exercise control over potential threats do not conjure up apprehensive thoughts and hence are not distressed by them [22]. The detailed construct of self-efficacy is elaborated in Chapter 6.

1.3 Background of application space : Trauma-Recovery

Mental trauma following disasters, military service, accidents, domestic violence and other traumatic events is often associated with adversarial symptoms like avoidance of treatment, mood disorders, and cognitive impairments. Each year, over 3 million people in the United States are affected by post-traumatic stress, a chronic mental disorder. Lack of treatment for serious mental health illnesses annually cost $193.2 billion in lost earnings [111]. Providing proactive, scalable and cost-effective web-based treatments to traumatized people is, therefore, a problem with significant societal impact [32]. Though self-help websites for trauma-

3 recovery exist, they are usually generic, based on one-size-fits-all models, instead of being person-specific.

Interventions are required that can be used repeatedly without losing their therapeutic power, that can reach people even if local health care systems do not provide them with needed care or recommend costly procedures or are simply unapproachable. Such interventions require adaptive models that are tailored to individual needs and can be shared globally without taking resources away from the populations where the interventions were developed. Research suggests that personalization and automated adaption in self-help websites can positively aid people with mental health issues and advance mental health computing [40].

In this section, before we proceed to explain the specifics of Engagement and Self-Efficacy for trauma- recovery, we will first introduce Social Cognitive Theory (SCT), which is the base of the trauma-recovery website’s framework. SCT was introduced by renowned psychologist Albert Bandura [16] and developed further by number of researchers such as Charles Benight, Aleksandra Luszczynska [26, 142] and others. SCT is an interpersonal level theory that emphasizes the dynamic interaction between people, their behavior and their environments [92]. SCT prescribes mastery as the principal means of personality change [18]. SCT has also been widely applied to various fields, besides trauma-recovery, like learning computer-skills [61], media attendance [131], improving academic performance [132], aid gifted education [37] and health promotions

[20].

One special application/web-intervention of SCT, that is available to us for research, was designed for the treatment of trauma http://ease.vast.uccs.edu/. The determinants of this website were based on the study that applied multiple controls for diverse sets of potential contributors to posttraumatic recovery. In these different multivariate analysis, perceived Coping Self-Efficacy (CSE), explained in Chapter 6, emerged as a focal mediator of posttraumatic recovery [26]. This finding of CSE being critical to trauma-recovery, makes it generalizable for trauma treatments irrespective of the kind of trauma whether from natural disasters, technological catastrophes, terrorist attacks, military combat, sexual or criminal assaults.

Social learning analysis of SCT also suggests that expectations of personal efficacy are based on four major sources of information: performance accomplishments, vicarious experience, verbal persuasion and physiological states [18]. Amongst these four aforementioned sources, physiological states can be monitored

4 through body-sensors and is found to be predictive of stress control/arousal. Arousal although not the focus of this thesis, is a fundamental component of changes in coping self-efficacy for post-traumatic stress [30].

Arousal mechanisms are exciting and important to understand because at the deepest level they impact all human behaviors [173]. The pattern of arousal defines physiological toughness and, in interaction with psychological coping, corresponds with positive performance in even complex tasks, with emotional stability, and with immune system enhancement [70].

Unfortunately, measuring some form of arousal from video-feed proved to be ineffective with captured videos of users during trauma-web sessions. Moreover, visual estimates of arousal are challenging and lack the precision of arousal measures from body-sensors [107]. Since quantifying self-efficacy through video-feed has not been explored before, the work in Chapter 6 will be the first machine-learning approach to fuse visual and physiological data to predict changes in self-efficacy. As we have shown with sensor data, that visual and physiological data can predict changes in self-efficacy, in future, there is a possibility to explore how to find robust vision based techniques for physiological measurements [214, 136].

Changes in CSE is key for trauma-recovery which promotes engagement and positive adaptation. As no two people interact with the same content in the same way, engaged users have distinctive approaches to trauma web-interventions that are based on internal self-regulation of emotional and physical distress. Building a smart system that reduces interaction overhead of individuals by combining sensing and machine learning, to predict user-mental states of perceived self-efficacy and engagement, offers the first steps towards creating an adaptive web-based trauma-recovery intervention.

1.4 Limitations of prior work

A recent review on web-based interventions in psychiatry, outlines the challenges in replacing face-to-face inter- actions with web-based interventions, like unavailability of non-verbal cues, infrequent responses, disinhibition, patient emotion/role management [153, 164]. In face-to-face psychotherapy sessions, psychologist monitor series of complex processes such as spontaneous facial expressions, emotions, mood swings, engagement and

5 other behaviors when working with patients undergoing treatment.

With recent advances in the field of affective computing, it is now possible to recognize facial expressions, detect spontaneous emotions and infer affective states of an individual. Computer vision and deep learning, have made it possible to assess the current emotional state of an individual with web-cam [71, 24, 121].

Contrary to self-reports, analyzing facial expressions is non-intrusive and does not interrupt user experience.

However, much is done in the space of six basic expressions (happiness, sadness, anger, disgust, fear, surprise) and the facial action units associated with them [118], very little is done for creating face-based engagement models. Most affective studies have elicited emotional responses via an external stimulus [102, 201, 127] and focussed on two basic dimensions of valence and arousal [174]; prototypical expressions of basic emotions occur relatively infrequently in natural interaction and even less so while people operate on a web- based intervention. This makes modeling of facial expressions for engagement prediction during web-based interventions a challenging task. Moreover, existing engagement models in affective computing are confined to dyadic interactions of human-robot [82], social-interactions [218] and student learning [159, 34], which cannot be assumed to be applicable directly for trauma-recovery. Apart from face-based methods, non-visual parameters to measure engagement have also been used in student learning like analyzing website logs [57], time on task [129], body postures [75] etc. However, interactions with a web-application change over time, and are composed of a series of concurrent and sequential dynamics. People with mental-health issues are not necessarily as compliant as students are in online learning. Continuous and dynamic representation of how the engagement level of a user changes over time by monitoring their facial-expressions, has the potential of giving insights about user mental-state that are impossible to derive from non-face based solutions.

As mentioned earlier, self-efficacy has proven effective in diverse domains and current gold-standard to measure self-efficacy is via questionnaires, which is an infrequent and intrusive measurement. The most recent study that targeted user engagement and changes in self-efficacy for depression was done by Morris etal.using a crowd-sourced application of peer support networks on a Panoply platform [160]. The aim of the Panoply project was to create an evidence-based application to enhance user-engagement but lacked the availability of any visual information of users. Morris etal.hypothesized that repeated use of the platform would reduce

6 depression symptoms and that the social, interactive design would promote engagement. However, there was no depression status criteria for participant inclusion and the app was advertised as a stress reduction app, which raises concerns about the generalizability of treatment effects. Other methods for evidence based interventions include creating virtual agents that empathize with subjects, however till date, there is no evidence that these empathetic virtual coaches work better than a neutral virtual coach [192]. To summarize, most web-based applications assume changes in efficacy as an intermediate effect for treatment and do not take into account users affective or cognitive state. In our research, we aim to predict self-efficacy changes while monitoring facial-actions and physiological responses. Learning based self-efficacy prediction models have never been developed till date and the exploratory approach on an evidence-based website of trauma-recovery, with models built on real-data from trauma-subjects, will guarantee generalizability for mental-health interventions.

According to social psychology, human experience is an intertwined outcome of behavior (interactions), cognition (thoughts) and affect (feelings) [116]. Behavior and affect are generally detected through a series of facial expressions, gestures, body movements, speech, and other physiological signals, such as heart rate, respiration, sweat, etc. The purpose of this research is to explore a machine learning based approach towards analyzing and predicting cognitive human behavior through behavioral modeling from face videos. Based on

SCT, we target a closed space of EASE (Engagement, Arousal and Self Efficacy) for the proposed research.

1.5 Scope of this research

The aim of the research presented in this thesis is to develop and analyze machine learning techniques for building automated models for engagement and self-efficacy. We develop algorithms that use both voluntary and involuntary feedback encapsulating the interactions of brain and body in a non-intrusive setting in order to monitor coping capabilities and engagement levels. We build on well established social-cognitive theory, where arousal and self-efficacy are critical elements of self-regularization. We measure engagement, critical in self-directed web-based treatment. However, the goal of this thesis, is not to invent a new psychology theory that claims to solve the problem of web-interventions or modify SCT. Rather, the goal of the thesis is to

7 address the issues faced by self-help applications, identify the problems in building a learning based model and propose and evaluate solutions. As mentioned earlier, self-reports are limited by the frequency at which the user can be asked for a response without significantly annoying them. Self-reports are impossible to collect on a per-second basis. Through this work we aim to formulate the first steps towards an adaptive, learning-based, self-regulated, intervention applied to the domain of trauma-recovery. Listed below are the research questions that drive our work:

1. How do we reduce arduous psychometric measures and create machine learning models to automatically

estimate user response?

2. Can self-efficacy be predicted from videos or physiological data or both?

3. Is engagement or change in self-efficacy dependent on the task at hand?

4. How do we integrate a priori information or meta-data about the user to predict mental state?

5. What features should we use to predict engagement from video?

6. How do we evaluate the created models for generalizability?

1.6 Contributions from this thesis

1.6.1 Dataset Collection

One of the contributions of this thesis, is the collection, organization and sharing of data for behavioral research, discussed in Chapter 2. It is termed as Engagement, Arousal and Self-Efficacy (EASE) dataset and consists of data gathered from 110 subjects. The duration of recordings is about one hour per session. The video recordings in the EASE dataset for all subjects are taken at multiple visits with 1-2 weeks between each recording. Subjects provide self-reports for engagement, arousal, self-efficacy, mood, depression and other measurements (see Table 2.2). Continuous annotations are for 864K seconds of video, considering a minimum of 80 subjects for 3 sessions of 1 hour each. Affect/behavioral research dataset typically consists of approximately 25-40 subjects with recordings lasting 3-15 mins each. Thus, the collected EASE dataset is significantly larger effort compared to current state-of-the-art in the field (see Chapter 2 for details). EASE

8 dataset will be publicly released to facilitate further development in this area 3.

1.6.2 Contextual Engagement Prediction from video

A wide range of research has used face data to estimate a person’s engagement, in applications from advertising to student learning. An interesting and important question not addressed in prior work is if face-based models of engagement are generalizable and context-free, or do engagement models depend on context and task. We show that engagement in a given situation can be predicted through subtle facial movements and fine-grained actions such as changes in facial expressions. The proposed prediction model incorporates context and temporal variations of input features. This study, is the first of its kind to define engagement as context-sensitive. In

Chapter 3, we describe the problem of engagement prediction from video in detail and conduct thorough experiments to verify our claims.

1.6.3 Mood-aware Engagement Prediction

We conjecture that subject’s current mood might have an effect on the engagement level with the task at hand.

We verify our claim empirically in Chapter 4. While significant amount of work in recent years has focussed on advancing machine learning techniques for affect recognition and affect classification, the prediction of mood from facial analysis and the usage of mood data have received less attention. Questionnaires for psychometric measurement of mood-states are common, but using them during interventions that target psychological well-being of people are arduous and may burden an already troubled population. In Chapter 4, we present mood prediction as a sequence learning problem and use the predicted mood-scores to aid our contextual engagement pipeline.

1.6.4 Automated Action Units vs. Expert Raters

Several data collection techniques have been explored in the past that capture different modalities for engage- ment from obtaining self-reports and gathering external observations via crowd-sourcing or even trained expert

3 We have IRB approval to release only unidentifiable data including Action Units extracted from facial videos. We cannot display identifiable data, include video frames from the dataset.

9 raters. Traditional machine learning approaches discard annotations from inconsistent raters, use rater averages or apply rater-specific weighting schemes. Such approaches often end up throwing away expensive annotations.

We introduce a novel approach in Chapter 5 that exploits the inherent confusion and disagreement in raters annotations to build a scalable engagement estimation model that learns to appropriately weigh subjective behavioral cues. We show that actively modeling the uncertainty, either explicitly from expert raters or from automated estimation with facial action units, significantly improves prediction over prediction from just the average engagement ratings. Our approach performs significantly better or on par with experts in predicting engagement for the trauma-recovery application.

1.6.5 Self-Efficacy prediction: Problem formulation

Benight etal.reviewed substantial empirical evidence on the role of perceived Coping Self Efficacy (CSE) in recovery from traumatic experiences within the framework of SCT [26]. In psychology studies, CSE [32] is quantified from individual responses to questionnaires that access users perception of their own capabilities to cope with the situational demands of trauma-recovery. We lay the foundations of a machine learning model using EASE dataset for self-efficacy prediction by evaluating two things: (1) the effect of CSE-changes on symptom severity and (2) the relationship of engagement and arousal to self-efficacy. The findings of the aforementioned evaluation, strengthens the basis for feature selection or predictors in our learning based self-efficacy model.

1.6.6 Predicting Change in Self-Efficacy using EA features

Neither the absolute value of self-efficacy nor the changes in self-efficacy, are directly observable from visual inferences. In Chapter 6 we implement and evaluate a learning technique that can predict the self efficacy, using the pre-training self-efficacy value along with derived predictions of engagement from visual data and physiological arousal. As per the authors knowledge this the first attempt to infer changes in self-efficacy, during treatment, from visual data, physiological signals and infrequent questionnaires.

10 1.7 Publications from this work

1.7.1 First author publications

1. S. Dhamija, T. Boult “Exploring Contextual Engagement for Trauma Recovery” Workshop on Deep

Affective Learning and Context Modelling (DALCOM), CVPR 2017.

2. S. Dhamija, T. Boult “Automated Mood-aware Engagement Prediction” Affective Computing and

Intelligent Interaction (ACII) 2017.

3. S. Dhamija “Learning based Visual Engagement and Self-Efficacy” Affective Computing and Intelligent

Interaction (ACII) 2017, Doctoral Consortium.

4. S. Dhamija, T. Boult “Automated Action Units Vs. Expert Raters: Face off” IEEE Winter Conference

on Applications of Computer Vision (WACV) 2018.

5. S. Dhamija, T. Boult “Learning Visual Engagement for Trauma Recovery” Workshop on Computer

Vision for Active and Assisted Living (CV-AAL), WACV 2018.

1.7.2 Other work

6. A.M.Ragolta, S. Dhamija, T. Boult, “A Multimodal approach for predicting changes in PTSD symptom

severity” ACM International Conference on Multimodal Interaction (ICMI) 2018.

7. K. Shoji, A. Devane, S. Dhamija, T. Boult, C.C. Benight “Recurrence of Heart Rate and Skin Con-

ductance During a Web-Based Intervention for Trauma Survivors” International Society for Traumatic

Stress Studies (ISTSS), 2017.

8. C.C. Benight, K. Shoji, C.Yeager, A. Mullings, S. Dhamija and T. Boult “The importance of self-

appraisals of coping capability in predicting engagement in a web intervention for trauma”. International

Society for Research on Internet Interventions (ISRII), 8th Scientific Meeting 2016.

9. C.C. Benight, K. Shoji, C. Yeager, A. Mullings, S. Dhamija, T. Boult “Changes Self-Appraisal and

Mood Utilizing a Web-based Recovery System on Posttraumatic Stress Symptoms: A Laboratory

Experiment”. International Society for Traumatic Stress Studies (ISTSS), 2016.

11 10. K. Shoji, C.C. Benight, A. Mullings, C. Yeager, S. Dhamija, T.Boult “Measuring Engagement into the

Web Intervention by the Quality of Voice”, International Society for Research on Internet Interventions

(ISRII), 2016.

12 Chapter 2

EASE Dataset Collection

2.1 Introduction

In this chapter we will present the EASE (Engagement Arousal and Self-Efficacy) dataset that will be used for all experiments in Chapters 3, 4, 5 and 6. Before diving into the data collection procedure, we first provide a brief overview of the existing engagement and arousal datasets available in Section 2.2. In the sections following related work, we elaborate on the EASE space variable measures and the challenges faced during data-collection.

2.2 Related Work

Behavioral and physiological arousal has been studied extensively in the affective computing domain [102, 85].

There are numerous datasets which induce elicitation and tag videos with arousal and valence information

[201, 127, 128]. However, such datasets have no measures of engagement and self-efficacy. For example, the

RECOLA multimodal database was designed with objective to developing a computer-mediated communica- tion tool and to study remote collaborative and affective interactions. The database consists of 9.5 hours of audio, visual, and physiological (electrocardiogram, and electrodermal activity) recordings of online dyadic

13 interactions between 46 French speaking participants, who were solving a task in collaboration [180]. The data focussed on annotations for valence and arousal for each participant and did not include explicit labels to study engagement and self-efficacy of the subjects.

The largest available dataset for student engagement currently is Massive Open Online Courses (MOOCs) which is collected from online-student-learning websites like Coursera, Khan Academy etc. The drawback of

MOOCs, however, is the unavailability of student face videos [101]. In order to study engagement prediction for student learning, Kamath etal proposed a dataset inspired from MOOCs, consisting of student videos viewing an online lecture. The dataset used by the authors for their experiments [118], however, did not include the entire video but very few frames which were annotated for engagement via crowdsourcing. The raters were asked to rate student engagement level as being one of the three: Not engaged, Nominally engaged and

Very Engaged. Since the data did not include video, but individual image frames, it is limited for developing changes in interactions over time. Furthermore, the total number of annotated frames consisted of only 4500 frames, which is significantly less for building deep-learning models [135, 52].

Static MMDB dataset was proposed by Tsatsoulis etal.to study the interactions between a trained examiner and a child aged 15-30 months [212]. The dataset consists of audio/video recordings and of individual sessions along with ratings of engagement and responsiveness at substage level. The ratings consisted of frame-level, continuous annotation of relevant child behaviors. While the dataset is similar in spirit for studying face-based engagement predictions, it does not contain any data corresponding to self-efficacy. Furthermore, kids are not expected to perform a specific task, but merely interact with the trained examiner.

Due to the lack of standardized large-scale datasets [175] and engagement data confined to classroom experiments [227, 158], there is a need to collect datasets with efficacy and engagement measures. In this chapter, we explain the data collection setup for the EASE space and the various measures.

14 Webcam (Face Data)

Subject

Microphone (Voice Data) Camera for recording subject interacons

Wearable Bio-­‐sensors

Figure 2.1: Setup for EASE data collection: Subjects interacted with the website while performing self- regulation exercises. Face data was captured using an external webcam; voice data was captured using a microphone. Additional data such as skin-conductance, respiration, and ECG signals were also recorded using wearable sensors. All the interactions were recorded using an external camera.

2.3 EASE Data Collection Setup

Based on SCT, “My Trauma Recovery” website developed by Dr.Benight and others [27] has divided trauma recovery into six modules i.e., relaxation, triggers, social-support, self-talk, unhelpful coping and professional help. The goal of the website is to reduce hospitalization rates and health care costs by providing free self-guided trauma-recovery treatment.

There are 6 modules on the My Trauma Recovery website. Each participant was assigned two out of the six modules in each visit. The first two visits were restricted to Relaxation and Triggers modules only, and in the third visit, the participants were free to choose from the remaining four modules. In the first visit, subjects were randomly allocated Relaxation or Triggers as the first module and a reverse order during the second visit. At the beginning of each session, the subject’s watch an neutral introductory video for 2 minutes and

32 seconds, after which they are asked to close their eyes and relax for one minute. The physiological data measurements during this duration of quiet-time are considered as the baseline for the subject. The relaxation

15 module presents the user with video demonstrations of various exercises like breathing, progressive muscle relaxation, etc. The triggers module educates the user about trauma symptoms and prevention. Each visit (here on referred to as session) lasted for approx. 30 minutes - 1.5 hours.

During these sessions, a Logitech webcam with a resolution of 640x480 at 30 fps recorded the participants face video along with audio. Physiological data was also recorded for the entire session, as shown in Figure

2.1. The participants could freely interact with the trauma recovery website, and their interactions were recorded in the form of Picture in Picture video using a Camtasia recorder (with screen and webcam recording simultaneously). During the module, participants provided self-reports about their engagement level, coping capabilities, moods etc.

2.4 Participants and Demographics

Subjects with potential Post Traumatic Stress Disorder (PTSD) were recruited from three Colorado Springs health service providers: the TESSA domestic violence center, Peak Vista Community Health Partners, that has 21 centers for seniors, the homeless, a women’s health center and six family health centers, student online research system (SONA), and the Veterans Health and Trauma Clinic i.e. a clinic that provides services to the veterans, active duty service members, and their families who are struggling with combat trauma. Subject inclusion criteria was as follows [28]:

1. Having experienced at least one traumatic event in the last 24 months based on the Diagnostic and

Statistical Manual of Mental Disorders (DSM-V A) [9] criteria.

2. Be over 18 years of age.

Currently, we have collected data for 110 subjects amongst which 86 subjects completed all 3 sessions.

We have 10 control subjects, and data collection is ongoing to add more control subjects to the set. The participants lie in the age groups of 18-79 years with 80% of the population less than or equal to 46 years.

The demographic distribution of the participants is shown in Table 2.1.

16 Participants and Modalities # Participants Total 110, 88 Female, 17 Male, 5 did not specify Group1 (65 subjects) — Module 1: Triggers, Module 2: Relaxation Session 1 Group2 (45 subjects) — Module 1: Relaxation, Module 2: Triggers Group1 (65 subjects) — Module 1: Relaxation, Module 2: Triggers Session 2 Group2 (45 subjects) — Module 1: Triggers, Module 2: Relaxation Module 1: 25 Unhelpful coping, 24 Self Talk, 24 Social Support, 27 Professional Help Session 3 Module 2: 25 Unhelpful coping, 22 Self Talk, 30 Social Support, 23 Professional Help ElectroCardioGraph (ECG) 256 Hz Skin Conductance (256 Hz) Recorded Signals Respiration (256 Hz) Face Video, Audio and Monitor @ 30 fps 7 American Indian or Alaskan Native 6 Asian or Pacific Islander 15 Black or African American Ethnicity 12 Hispanic or Latino 77 White or Caucasian 1 Caribbean Islander

Table 2.1: EASE dataset content summary: The table above shows the demographics of participants and the distribution of modules taken by them in each session.

2.5 The EASE space variables

Engagement information is collected in two ways: subject’s self assessment and annotated data from psy- chologists. Arousal is gathered through physiological signals and behavioral data annotations. Standard questionnaires are used to evaluate treatment effects on user.

2.5.1 Engagement measures

There were two types of engagement responses collected through this experiment : data from self-reports and external annotations.

Real-time self reports were collected while the subject worked on the trauma recovery module. The original website was modified to include an engagement question at specific pages within the module. Subjects were asked to enter their response to a question about their level of engagement on a scale of 1-5 with 1 being the lowest and 5 as the highest as shown in Figure 2.2. The self-reports appear approx. three times within each module. The website’s SQL database stores the user response with current date, timestamp, module details

17 and page information.

Figure 2.2: Self reported engagement: During the experiment the subjects are presented with a screen asking them about their engagement level on a scale of 1-5, where 1 is “Very Disengaged” and 5 “Very Engaged”. The user is not allowed to proceed to the next page on the website unless the question has been answered. The response is stored in a SQL database with timestamp.

Figure 2.3 shows the details about the data. As mentioned earlier, each participant came in for three visits.

The first two (controlled) visits are used for all experiments in this work. In each visit, the subjects undergo relaxation and trigger tasks. While there were more subjects, some dropped out during the study. Some video segments could not be used due to synchronization issues and missing markers (refer section 2.6 for more details). Hence, for the first session we have data from 95 subjects and from the second session we use data from 80 subjects. Some of the collected data was unusable due to either system issues, data corruption or lack of engagement self-reports.

After the sessions, trained psychologists, annotate the videos using CARMA (Continuous Affect Rating and Media Annotation) software [90] on a scale of -100 to +100. Each video is rated by 3 different raters based on visual cues as shown in Figure 2.4. The rater size of three is taken to allow inter-rater agreement measures and inter-class agreements for more reliable values. The per subject ratings are stored in comma separated (csv) files with per second annotation, along with the time of rating and video filename. Per second annotation is a costly process which includes training of psychologists and actual rating time. Majority of the relaxation videos for each session were around 30-45 mins., therefore to reduce rating costs, we segmented the original video (criteria > 12 min) into 4 min segments from start, middle and end of the module for CARMA annotations. Segmenting relaxation videos into 3 segments ensured minimal loss of valuable information and reduced time and costs.

18 Figure 2.3: EASE Data Distribution: This figure displays information about participants and the distribution of modules taken by them in each session considered for engagement analysis in this work. Participants consisted of total 110 subjects with 88 Female, 17 Male, 5 did not specify in the age group of 18-79 years, with 80% being under the age of 46. Engagement self-reports and corresponding availability of videos reduced the data size from 110 subjects, to data from the first session from 95 subjects and from the second session 80 subjects.

2.5.2 Arousal data

We used biograph infiniti TT-AV Sync Sensor (T7670) and TT-USB interface unit with an encoder. An Infiniti encoder is a multiple-channel battery-operated device for real-time psychophysiology, biofeedback and data acquisition. We use three sensors, a 3 lead EKG, skin conductance sensor (SA9309M) and a respiration belt.

The physiological signals are sampled at 256Hz and stored in a text file. 24 event markers (12 marker pairs) were entered by research assistants during the session in the biograph datafile. Each marker pair represented the start and end of : introductory video, quiet-time, module 1, module 2, vocal responses (6) and surveys (2).

In the past, speech analysis have proven that voice has enough information to determine the stress levels of a subject [105], [197] ,[60]. With the same intent in mind, after completion of each module, users were asked for vocal responses to the following questions:

1. How will the module help or not help in dealing with the emotional challenges related to your trauma?

2. Comment about the usability of the module?

3. How do you feel the module will be beneficial in helping you cope?

Continuous annotations by psychologists estimating arousal from videos, was explored but is still considered questionable (Please refer to the challenges Section 2.6).

19 Figure 2.4: Engagement annotation from CARMA: The above figure shows engagement ratings provided by multiple raters (psychologists) for a relaxation segment of 4 minute duration. The ratings are available on a per second basis throughout the video segment. The y-axis shows the range of the ratings, -100 (disengaged subject) to 100 (highly engaged subject). It can be observed from the ratings that there is significant amount of inter-rater variability in the collected annotations. The face images are shown for illustrative purpose, and are not directly related to the engagement ratings.

2.5.3 Trauma-recovery outcomes and dependent variables

The subjects were presented with multiple questionnaires targeted at measuring PTSD related affect. The questionnaires broadly comprise of general outcome variables, module affect response assessments and control variables. Various psychology studies measure these variables from standardized questionnaires in the form of surveys.

Participants for the EASE study were chosen based on a screening test using Trauma Exposure Scale (TES).

TES is used to measure the types of trauma a person has been exposed to, or the degree of severity of the traumatic event someone experienced. PTSD-Checklist (PCL) was used to measure trauma symptom severity.

Table 2.2 shows the list of variables and their corresponding point scales. PCL, Center for Epidemiologic

Studies Depression Scale (CES), Confidence in use of Technology (TCONF), Coping Self-Efficacy General

20 (CSEg) general and CSE Chesney’s (CSEc) are general outcomes measured before and after experiment i.e. beginning of session 1 and end of session 3. Profiles of Mood States (POMS), Perceived Distress Quotient

(PDQ), Distress (DIST) and CSE-Trauma (CSEt) were taken before and after each module to capture module effect.

Self-reported data from standardized questionnaires Name Description Point Scale Trauma Exposure Scale (TES) was used to assess TES 10 trauma exposure levels PTSD Checklist-Civilian Version (PCL-C) or the Military Version PCL 5 (PCL-M) used to assess post-traumatic stress symptoms Center for Epidemiologic Studies Depression Scale (CES-D) is a CES 4 20-item questionnaire assessing depressive symptoms in the past week Trauma Coping Self-Efficacy Scale is a 9-item questionnaire evaluating CSE 7 t belief in the capacity to manage demands related to trauma recovery Very Short Profiles of Mood States-Short Form (POMS-SF) was POMS 5 used to measure reactive changes in mood from 24 questions Responses to Script-Driven Imagery Scale-Dissociative Experience PDQ Subscale measures severity of dissociative experience provoked by 7 a task related to trauma Subjective Units of Distress Scale used to assess changes in perceived DIST 5 distress General Self-Efficacy Scale is a 10-item questionnaire to assess CSE 4 g optimistic self-beliefs to cope with a variety of difficult demands in life Chesney’s Coping Self-Efficacy Scale is a 13-item questionnaire evaluating CSE 11 c confidence in performing coping behaviors when faced with life challenges TCONF Confidence in use of Technology 5

Table 2.2: PTSD specific variables and measurements: This table shows specific measures collected from questionnaires given to participants to assess post-traumatic stress symptoms, depressive symptoms, mood scores, general distress, self-efficacy for trauma, trauma exposure, and a confidence for web and mobile support use. The measures were segmented into general outcome measures, module affect response assessment, and control variables. The general outcome measures were taken by the subject at baseline (before start of experiment) and again at the end of experiment. The module affective response was assessed before and after each module for every session.

2.6 Challenges

It is important to point out several challenges with this dataset collection. First, gathering and analyzing behavioral data in naturalistic scenarios is a cumbersome task. Second, sensors are always prone to motion artifacts, and the accuracy of the sensor is limited by the accuracy of sensor placement, the amount of gel

21 applied on the probe etc. Also, due to limitations in facial feature tracking caused by face occlusions, our methods will not be able to extract features from some video segments. For example, some subjects were too close to the screen while working on the website, and their face (or most of it) was not in the range of the webcam. Third, synchronization of video recordings (30 Hz) with sensors (256 Hz) is a challenging task due to varying data rates from these modalities. Moreover, event markers in the physiological datafile were entered by research assistants during sessions manually, by tapping on a spacebar. Since these markers were used for video synchronization in an automated process, the reliability of the manual marker determined the correctness of video synchronization. Fourth, in case a start/end event marker (from the marker pair) was missing in the biograph file, the duration for the event could not be determined resulting in a missing video.

Fifth, it has been observed, that annotating arousal from video is challenging even for trained clinical experts.

The behavioral arousal ratings had very low inter-rater reliability, and hence no validity. The psychologists

finally ended up classifying their annotations to Activity/Inactivity. Another strategy to save rater time was to rate 2-times sped-up videos, so the engagement annotations were reduced from 1 annotation for 30 frames to 1 annotation for 60 frames.

22 Chapter 3

Contextual Engagement Prediction from video

3.1 Introduction

Engagement is a critical component of student learning, web-based interventions, marketing, etc. and face- based analysis is the most successful non-invasive approach for engagement estimation [146, 144, 202, 157,

226]. Techniques that involve face-based estimation of the six basic emotions are considered to apply to any situation, i.e. nearly universal. An interesting and important question not addressed in prior work is if face-based models of engagement are generalizable and context-free, or do engagement models depend on context and task. This work examines the role of “context” in engagement, in particular, engagement in trauma treatment. As shown in Figure 3.1, depending on the task, the same facial expression can be interpreted as engaged or disengaged.

Emotions (anger, fear, surprise, disgust, sadness and happiness), are well studied with multiple product- s/systems to estimate emotional response from face data in a context-free manner. While named emotions are important, the basic emotion categories fail to capture the complete richness of facial behavior. Furthermore,

23 Facial General Engaged Feature Engagement Extraction Prediction Disengaged Traditional Engagement Prediction Pipeline Context Facial switch Context 1 (LSTM) Engaged Feature Extraction Context 2 Disengaged (LSTM) Context Sensitive Engagement Prediction Pipeline

Figure 3.1: Consider the images on the left. Which subjects are engaged and which are disengaged? Would you change your answer if you knew one had a task of doing a relaxation exercise? What if it was reading web content, watching a video or taking a test? We contend that face-based engagement assessment is context sensitive. Traditional engagement prediction pipelines based on facial feature extraction and machine learning techniques learn a generic engagement model, and would consider the face in lower left disengaged. In trauma recovery, individuals are often advised to do particular exercises, e.g. self-regulation exercises where the task involves the subject to “close your eyes, relax and breathe”. The image on the lower left is a highly engaged subject. Hence, there is a need to re-visit existing facial-expression-based engagement prediction techniques and augment them with the context of the task at hand. As shown in bottom right, this work develops context-sensitive engagement prediction methods based on facial expressions and temporal deep learning. most studies have elicited emotional responses via a stimulus; prototypical expressions of basic emotions occur relatively infrequently in natural interaction and even less so while people operate on a web-based intervention. Facial analysis gives strong clues about the internal mental state, e.g., a slight lowering of the brows or tightening of the mouth may indicate annoyance.

The ground truth for the development of engagement prediction systems generally comes in one of two forms: sparse labels obtained from self-reports [72] or observational estimations from external human observers [226]. Generally, self-reports are collected from questionnaire’s in which subjects report their level of engagement, attention, distraction, excitement, or boredom; in observational estimates, human observers use facial and behavioral cues to rate the subjects apparent engagement.

D’Mello etal.[158, 168, 75, 10] and Whitehill etal.[226, 227] have a longstanding line of research into learning-centered student engagement and the facial expressions associated with it. Recently, correlations between specific fine-grained facial movements and self-reported aspects of engagement, frustration, and

24 learning were identified [96]. Thus, to create engagement prediction model for trauma recovery, we use subtle facial movements or action-units rather than emotional responses as an intermediate representation.

In this chapter, we show that computer vision and deep-learning-based techniques can be used to predict user engagement from webcam feeds. We further prove that context-specific modeling is more accurate than a generic model of user engagement. We evaluate our machine learning model on the EASE dataset, from session1 and 2, described in detail in Chapter 2. Our research in visual engagement will also address two key questions: how much time do we need to analyze facial behavior for accurate engagement prediction and which deep learning approach provides the most accurate predictions.

The contributions of this work are as follows:

1. The first exploration of engagement in two contextually different tasks within the recovery regime:

“Relaxation” and “Triggers”.

2. Developing automated engagement prediction methods based on automatically computed Action Units

(AUs) using Long Short-Term Memory (LSTM). We train/test this on both subtasks within trauma

recovery.

3. We build context-specific, cross-context and combined prediction models and show the importance of

context in predicting engagement from facial expressions.

4. We use sequences of AU features for engagement prediction and compare different types of sequence-

learning algorithms, namely, Recurrent Neural Networks (RNN), Gated Recurrent Units (GRU) and

LSTMs.

5. We compare learning using different length segments of AUs and compare prediction accuracy when

using anywhere between 15 and 90 seconds of data.

While this work addresses engagement in the context of treatment of mental trauma, we conjecture that our observations of contextual engagement hold for not just trauma recovery but also for other domains. For example, in student learning, a student looking up could be engaged if the current task is listening to lecture or watching something, or could be disengaged if the task is to be reading or taking a test, as shown in Figure 3.1.

25 3.2 Related work

The work presented in this chapter is related to research from multiple communities such as context-models in vision, web-based intervention methods, trauma recovery, facial action units and expression analysis, automatic engagement recognition for student learning, deep-network-based sequence learning methods and others.

Context in vision research: Context-sensitive models have a wide range of uses in computer vision.

Context, in general, can be grouped into feature level context, semantic relationship level context, and prior/priming information level context [224]. The feature and semantic levels address context within the scene/image, i.e., “context” implies spatial or visual context, and often a goal is to estimate that local/global from the data and to use that to help in primary vision task, e.g. [178, 232, 240, 209, 25, 134, 162]. There are also non-visual semantic-relationship level contexts, such as camera/photo metadata, geographical and temporal information, event labels, and social relationships pulled from social media tags [134].

Contextual priors are different as they are inputs to the system, e.g., using label/attribute priors [83, 188],

3D geometry priors [231], and scene/event priors [229]. In almost all of these works, using context improved vision-based estimation, and we hypothesize context will improve for engagement evaluation as well. The role of context for user engagement has been explored in affective computing literature [143] for designing intelligent user interfaces for office scenarios. In human-robot interaction [184, 44, 43] have shown additional knowledge of user context can better predict an affective state of the user and demonstrating that engagement prediction is contextual and task dependent.

In this work “context” is not a feature or semantic data in the image; contextual engagement is about the expected behaviors the system should be observing in engaged subjects. Because the system determines what is displayed and expects a particular user activity, the “context” is known a priori, i.e. at a high-level, we have a contextual prior with probability 0 or 1. Thus, our goal is not to estimate context for engagement or incorporate prior probabilities into a model, but rather to learn context-specific models and use them to predict user engagement. Inferring “context” for engagement may still be possible for the video, but it is beyond the scope of this work and likely unnecessary in web-based settings.

Web-based trauma recovery: Researchers such as Bunge etal.[36] and Schueller etal.[194] have explored

26 Behavioral Intervention Technologies (BITs) to augment traditional methods of delivering psychological interventions, of face-to-face (F2F) in one-to-one psychotherapy sessions, in order to expand delivery models and/or increase the outcomes of therapy. Such work has led to the development of various web-based intervention platforms with an aim to provide cost-effective, large-scale services and quality healthcare.

Notable among these are works of Macea etal.[144], Mehta etal.[153] and Strecher etal.[205]. In the domain of web-based interventions for trauma recovery, the works of Benight etal.[28] and Shoji etal.[197] explored the role of engagement using voice analysis. The psychology study by Yeager etal.[235] explored in detail the role of engagement in Web-based trauma recovery. Influenced by these works, we postulate the need for the development of computer vision and machine learning-based methods for automated engagement prediction in the domain of web-based trauma recovery.

Student learning: The majority of research to predict automated engagement has been limited to the

field of education where learning algorithms are built to determine student engagement from behavioral cues like facial expressions, gaze, head-pose, etc. [169, 157, 226, 87]. These works primarily rely on extracting facial features and developing machine-learning-based approaches to identify engagement activity of students in classroom performing various tasks,e.g., reading/writing, etc. The subjects are often assumed to be co- operative with control over their emotions and monitored by an external actor e.g. the teacher. One of the notable differences in these works and data collected from trauma subjects is that subject co-operativeness varies significantly, depending on the severity of mental illness and the task (self-regulation exercises) that they are assigned, leading to multiple challenges in applying methods from student learning directly [189].

Emotion recognition: The popularity of interactive social networking, and the exhaustive use of webcams has facilitated the understanding of human affect or emotions by analyzing responses to internet videos

[211, 150, 81] and video-based learning[148, 239]. However, these methods are primarily geared towards developing methods to identify spontaneous emotions (e.g.“happy’, “sad” etc.). Due to their relevance to digital advertising and the availability of advanced learning techniques that are capable of exploring temporal dependencies, facial expression analysis to recognize emotional states is emerging as a new area of exploration

[149].

27 Facial features: Extracting facial features for expression analysis, facial action unit coding, face recog- nition, and face tracking has a rich history with some notable works [63, 114, 62]. Automatic detection of facial action units has proved to be an important building block for multiple non-verbal and emotion analysis systems [130]. For this work, we rely on leading facial landmark and action unit detection work, OpenFace, proposed by Baltrusaitis etal.[11]. OpenFace is the first open source tool capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. Some other representations, such as raw pixels, Gabor filters or LBP features could also have been considered for this work [199]. However, the work of [199, 226] in affective computing suggests AUs are a better choice for an intermediate representation of facial data.

Deep Learning for Affect Detection: In recent years, significant advances in deep learning have lead to the development of various affect detection methods based on deep learning. More specifically, researchers have applied deep learning techniques to problem such as continuous emotion detection [199], facial expression analysis [221], facial action unit detection [52] and others. Deep learning methods have used video data, sensor data or multi-modal data [228]. As noted earlier, engagement is a rational response unlike the spontaneity of emotions and, hence, we model context and temporal variations in the input feature representation with AUs.

Such problems are often modeled as sequence learning problems.

Sequence learning has a rich history in signal processing, machine learning and video classification with a number of notable techniques such as Hidden Markov Models (HMM), Dynamic Time Warping (DTW) and

Conditional Random Field (CRF)-based approaches being common place [115]. Recurrent neural networks initially gained prominence in computer vision and deep learning areas since they take into account both the observation at the current time step and the representation in the previous one. More recently, in order to address the gradient vanishing problem of vanilla-RNNs when dealing with longer sequences, LSTMs [109] and Gated Recurrent Units (GRU) were proposed [100].

Most existing face-based engagement prediction methods make little use of the temporal information of the video sequence. Unlike more ephemeral emotion, which can be elicited from short stimuli, engagement while doing web-based treatment/education is a long-term process, requiring long-term integration of information.

28 Our proposed prediction model includes context as well as a temporal sequence of facial Action Units (AUs).

Using AUs for engagement prediction is a relatively nascent area of research. In this work, we advocate the use of deep-learning techniques adapted for sequential data [236, 52, 97]. In particular we first explore using

Long Short-Term Memory (LSTM), a specialized form of recurrent neural networks, that are well suited for sequence learning problems and have emerged as a leading method for number of similar tasks such as video classification [236], video-affect recognition [228], as well as non-visual tasks such as speech, handwriting and music analysis [100]. Relative insensitivity to gap length gives an advantage to LSTMs over alternatives such as RNNs, HMMs and other sequence learning methods. Secondly, we evaluate the effect of length of video segments used for performing engagement prediction on deep sequence learning model and lastly, we analyze this effect across multiple leading deep sequence learning algorithms to understand the relationship between network complexity and prediction performance.

3.3 Algorithms for Sequence Learning

In recent years, wide range of sequence learning problems have been modeled using recurrent neural networks

(RNN). They learn a representation for each time step by incorporating observation from the previous step and the current step. This process allows RNNs to preserve information over time using recurrent mechanisms.

RNNs were developed by Schmidhuber etal. [190] and have gained applications in multiple sequence learning problems such as handwriting recognition, visual question answering, image generation, speech recognition and others. Hochreiter etal. [109] proposed long short-term memory (LSTMs), a variant of RNNs, consisting of memory cell. LSTMs have been widely used in sequence learning problems such as language modeling, image captioning, translation and others. More recently, Gated Recurrent Units (GRU) were proposed by

Cho etal. [50] to make each recurrent unit to adaptively capture dependencies of different time scales. Chung etal. [55] performed detailed comparison of LSTM and GRU in the domain of polyphonic music modeling and speech signal modeling and found GRU to be comparable to LSTM. Jozefowicz etal. [117] performed a similar but more extensive empirical analysis of various RNN, LSTM and GRU for language modeling and

29 music modeling. However, detailed comparison of various sequence learning algorithms in the domain of facial video analysis, especially for affective computing has not been actively explored in the past.

An RNN has connections between units to form a directed cycle allowing it to exhibit temporal dynamic behavior. They have internal memory to process sequences of inputs. LSTM and GRU share the similarities in terms of having gating units that modulate the flow of information inside the unit and balance the information

flow from the previous time step and the current time step dynamically. However, GRU does not require separate memory cell for this purpose, making its internal structure simpler than an LSTM.

3.3.1 Recurrent Neural Network

An RNN is a neural network that consists of a hidden state h and an optional output y which operates on a variable-length sequence x =(x1,...,xT ). At each time step t, the hidden state hhti of the RNN is updated by

h = f h 1 ,xt , (3.1) hti ht− i  where f is a non-linear activation function. f may be as simple as an element-wise logistic sigmoid function and as complex as an long short-term memory (LSTM) unit [109].

Our implementation is based on TensorFlow which is turn is based on [51], and we follow their notation.

We let subscripts denote timesteps and superscripts denote layers. An RNN can learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. In that case, the output at each timestep t is the conditional distribution p(xt | xt−1,...,x1). For example, a multinomial distribution (1-of-K coding) can be output using a softmax activation function

exp wjhhti p(xt,j = 1 | xt−1,...,x1)= K  , (3.2) ′ exp wj′ hhti Pj =1 

for all possible symbols j = 1,...,K, where wj are the rows of a weight matrix W. By combining these

30 probabilities, we can compute the probability of the sequence x using

T p(x)= p(x | x 1,...,x1). (3.3) Y t t− t=1

From this learned distribution, it is straightforward to sample a new sequence by iteratively sampling a symbol at each time step. The RNN reads each symbol of an input sequence x sequentially. As it reads each symbol, the hidden state of the RNN changes according to Eq. (3.1). After reading the end of the sequence

(marked by an end-of-sequence symbol), the hidden state of the RNN is a summary c of the whole input sequence.

3.3.2 Gated Recurrent Units

GRUs were proposed by Cho etal.[51] to enable RNNs to adaptively capture time dependencies at different time-scales. GRUs lie in between RNN and LSTM in terms of complexity. Similar to RNN, the hidden state of GRU is updated at each time step in order to be linearly interpolation between previous hidden states. They share similarities with LSTM units as they have gating units that modulate the flow of information inside the unit, but they do not require separate memory cells.

j th The activation ht for j unit of the GRU at time t is a linear interpolation between previous activation j ˜j previous activation ht−1 and candidate activation function ht :

j j j j ˜j ht = (1 − zt )ht−1 + zt ht (3.4)

j where zt is the update gate which is computed by

j j zt = σ(Wzxt + Uzht−1) (3.5)

31 The candidate activation function is computed similar to that of traditional recurrent unit

˜j j ht = tanh(Wxt, +U(rt ⊙ ht−1)) (3.6)

where ⊙ is element-wise multiplication and rt is set of reset gates which is computed using

j j rt = σ(Wrxt + Urht−1) (3.7)

In this formulation, when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

3.3.3 Long Short-Term Memory

Due to the advantages of LSTMs to preserve information over time and the ability to handle longer sequences, for the first set of experiments we use LSTMs to model long-term dependencies of AUs for engagement prediction. We model the problem of engagement prediction as a sequence learning problem where input consists of sequence xi of AUs computed from facial video data of a particular length. Each sequence is associated with a label yi which relates to the engagement self-reports provided by trauma subjects. Our implementation is based on TensorFlow which is turn is based on [98, 237], and we follow their notation.

l Rn All our states are with n equal to the number of AUs tracked. Let ht ∈ be a hidden state in layer l at

n m time-step t. Let Tn,m : R → R be an affine transform from n to m dimensions, i.e. Tn,mx = Wx + b for

0 some W and b). Let ⊙ be element-wise multiplication and let ht be an input data vector at time-step t. We

L use the activations ht to predict yt, since L is the number of layers in our deep LSTM.

The LSTM has complicated dynamics that allow it to easily “memorize” information for an extended

l Rn number of time-steps using memory cells ct ∈ . Although many LSTM architectures that differ in their connectivity structure and activation functions, all LSTM architectures have explicit memory cells for storing information for long periods of time, with weights for how to updated the memory cell, retrieve it, or keep it

32 for the next time step. The LSTM architecture used in our experiments is given by the following equations

[99], as implemented in TensorFlow basic cell LSTM:

i sigm          l−1 f sigm ht   =   T2n,4n       o   hl    sigm  t−1         g     tanh

l l ct = f ⊙ ct−1 + i ⊙ g

l l ht = o ⊙ tanh(ct)

where sigm is the sigmoid function, sigm and tanh are applied element-wise, i,f,o,c,h are the input gate, forget gate, output gate, cell activation vector and hidden vectors, respectively. In this work, we assume the length of the sequence is known apriori and hence use LSTMs with static RNN cells.

3.4 Experiments

We now describe the AU computation procedure to extract intermediate feature representation, followed by the methodology for tuning linear SVC and sequence learning models. Section 3.4.4 elaborates the contextual, cross-context and mixed LSTM models assuming an apriori length of video sequence (30 seconds). Section

3.4.5 contains the procedure for varying segment length of video sequence from 15 seconds to 90 seconds on three sequence learning algorithms, namely RNN, GRU and LSTMs. Finally, we discuss in detail the results obtained for engagement prediction across a variety of tasks and its task specificity.

3.4.1 Facial action units extraction

As noted in earlier section 2.5.1, the collected dataset consisted of a large number of facial video and engagement self-reports. While there are number of software available for extracting facial landmark points and facial action units (e.g. [63, 114]) we use the recent work on OpenFace proposed by Baltrusaitisˇ etal.[11].

33 It is an open-source tool which has shown state-of-the-art performance on multiple tasks such as head-pose, eye-gaze, and AU detection. For our work, we primarily focus on facial action units. The AUs extracted consisted of both intensity-based and presence-based AUs. Presence based AUs had a value of 1 if AU is visible in the face and 0 otherwise, intensity based AUs ranged from 0 through 5 (not present to present with maximum intensity). The list of AUs used in this work are as follows: Inner Brow Raiser, Outer Brow Raiser,

Brow Lowerer (intensity), Upper Lid Raiser, Cheek Raiser, Nose Wrinkler, Upper Lip Raiser, Lip Corner

Puller (intensity), Dimpler, Lip Corner Depressor (intensity), Chin Raiser, Lip Stretched, Lips Part, Jaw Drop,

Brow Lowerer (presence), Lip Corner Puller (presence), Lip Corner Depressor (presence), Lip Tightner, Lip

Suck, Blink.

Upper Face Acon Units

Lower Face Acon Units

Classificaon Acon Units

Figure 3.2: Action Units from OpenFace: The figure above shows 17 AUs derived from OpenFace. AU4, AU12 and AU15 are both intensity and presence based, therefore counted twice and making a total of 20 AUs that are used for engagement prediction. Amongst these 20, 14 AUs are intensity based and 6 are presence based. The 6 presence based AUs have a value of 1 if AU is visible on face and 0 otherwise, also called classification AUs. The 14 intensity based AUs comprise of 5 AUs are from upper face and 9 from lower face. Movements of these individual facial actions are used in the form of intensity values varying from 0 to 5 i.e not present to present with maximum intensity.

34 3.4.2 SVC model

As a baseline for engagement prediction, we Support Vector Classifier (SVC). Linear SVCs were trained using the 18000-dimensional feature vector (20 AUs for each of 900-time samples) for the hundreds of training segments for RX and TR. For training/testing, we used 10-fold cross-validation, and Table 3.1 reports average engagement prediction accuracy on 10-folds of test data. The table also shows “chance”, but since the five engagement levels do not have uniform probability, we consider random chance the algorithm that guesses based on per context per engagement level priors. Hence chance is different for RX and TR. The SVC performance is not significantly different from chance, probably because the number of training samples (RX

334 / TR 437) is much smaller than the dimensionality (18000).

3.4.3 Sequence learning models and tuning

In our experiments, for the all three classifiers LSTM, GRU and RNN, we train the deep networks to minimize the post softmax cross-entropy loss of the predicted and actual class. Each model is single layered with n units in sequence, where n is equivalent to number frames of temporal samples t. For example, an apriori known sequence length of 30 seconds@30 fps means t = 1..900. During the training process, we use Adam optimizer with a learning rate of 0.01 and fix training at 10 epochs. Since Adam optimizer uses larger effective step size, and our data has relatively sparse labels, we found that Adam optimizer performs better compared to gradient descent optimizer. The batch size used was 100 samples irrespective of context i.e. for both Relaxation and

Triggers. The forget-bias of the LSTM was set to 1.0 to remember all prior input weights. All gate activations were tanh for all three learning algorithms. The gradient computation for tanh is less expensive and converges faster than a sigmoid. For the first test in Section 3.4.4, we use the same 10-fold modeling which was used for SVC model. As seen in Table 3.1 the improvement in accuracy of LSTM over SVM supports the need of deep learning models for engagement prediction. Henceforth, for all further experiments in Section 3.4.4 and analysis of contextual, cross-contextual and combined models, we use LSTMs only. In Section 3.4.5, we use compare RNN, GRU and LSTMs for comparison of segment length on engagement prediction accuracy.

35 RX TR Chance 32.8% 38.3% SVC 31.4 ± 7% 35.2 ± 9% LSTM 39.1 ± 8.8% 50.7 ± 11%

Table 3.1: The table above shows the enhanced engagement prediction accuracy by using LSTMs over a Support Vector Classifier, highlighting the importance of a deep-learning model for this task.

3.4.4 Context-specific, Cross-context and Combined models

We use AU data from subjects performing “Trigger” task to create Trigger (TR) model, and similarly, AU data from subjects performing “Relaxation” task is used to create Relaxation (RX) model. Once the model is trained, to study the effect of context, we test these models using context-specific (testing with the same task) and cross-context technique (testing with the opposite task). The total number of engagement self-reports available for each session and task are shown in Figure 2.3. Due to a large number of video frames and sparse labels, we consider 30-second segments before each self-report for learning. For RX, we have 372 engagement self-reports leading to 372 segments of 900 samples (30 seconds@ 30 fps), totaling to 334800 frames of data 1.

With 20 AUs per frame and 900 frames per engagement sample, we have an 18000-dimensional feature vector.

We use 334 segments for training and 38 segments for testing. Similarly, for TR we have 485 segments of

900 samples (30 seconds@30fps) totaling to 436500 frames of data. We use 437 segments for training and 48 segments for testing. The training/testing labels are a discrete set of engagement levels 1 (very disengaged) through 5 (very engaged). By doing this, each engagement level is treated as a separate and mutually exclusive output.

3.4.5 Varying data-segment length

Because we want consistent data for different timescales we consider only those sequences that had a minimum of 90 seconds before response which reduces the dataset slightly from that used in Section 3.4.4. The total number of segments extracted from the EASE dataset for TR were 443 and for RX 347. For all experiments in this sub-section, we use 20-fold cross-validation. For all 20 fold experiments, the training data consists of

421 segments in training set and 23 segments for the testing set for TR module. Similarly, in RX module, our

1 Some segments shown in data in Figure 2.3 are not used due to synchronization issues.

36 3.0 3.0 SubjectID 098, Eng = 4 SubjectID 098, Eng = 4 SubjectID 008, Eng = 4 SubjectID 008, Eng = 3 2.5 2.5

2.0 2.0

1.5 1.5

AU intensity level 1.0 AU intensity level 1.0

0.5 0.5

0.0 0.0 0 100 200 300 400 500 600 700 800 900 0 100 200 300 400 500 600 700 800 900 Frame number Frame number

Figure 3.3: The above figure shows the inherently noisy signals from action unit transitions for “Outer Brow Raiser”. Each plot shows 30 seconds of AU data for two different subjects with engagement level show in the legend. The left plot is from Triggers module, where the subjects have exactly the same self-reported engagement, yet the AU intensity is quite different. The AU signal on the right is from the Relaxation module which shows similar signals for subjects with different engagement. Note that facial expressions for same subjects vary in different contexts of Relaxation and Triggers. This is a challenging prediction problem. training set consists of 330 segments for training and 17 segments for testing. Each segment was accompanied by an engagement self report.

Next, we extracted varying segment lengths of intensity-based AUs preceding the engagement self-report for 15, 30, 60 and 90 seconds each. For our experiments in this sub-section, we focus only on intensity-based

AUs to reduce the effect of combining categorical and numeric features. This reduces the input feature dimensions from 20 to 14 and therefore change in baseline accuracy for the contextual LSTM models in

Section 3.4.4. Figure 3.3 shows the changes in Outer brow raiser, AU02, intensities as a function of frame number. The figure also shows the challenges associated with temporal modeling of facial AUs. We train separate contextual LSTMs, GRUs and RNNs for each engagement prediction model (Relaxation – RX and

Trigger – TR).

37 RX - Test TR - Test RX - Train 39.1 ± 8.8 % 38.1 ± 4.4% TR - Train 36.7 ± 3.3% 50.7 ± 11% (RX + TR) - Train 39.5 ± 6.1% 49.1 ± 7.6%

Table 3.2: The first two rows above show contextual and cross-contextual engagement prediction results on EASE data obtained using LSTM in terms of prediction accuracy. If engagement was not contextual, the models to perform equally well in both cross-contexts, e.g. when the model is trained on Triggers (TR) and tested with Relaxation (RX) data. In the case of TR, context-specific and cross-context models show significant performance differences with the best accuracy being when training and testing are TR. Thus we can reject the hypothesis that context does not matter; it is formally rejected using a two-tailed paired t-test at the .01 level. Furthermore, combined-model (train RX+TR) had the highest amount of data for training and was still outperformed by the contextual models, showing again the importance of contextual modeling.

3.5 Evaluation results

3.5.1 Contextual engagement results

Table 3.2 shows the contextual engagement prediction results. We note interesting trends in prediction accuracy for context-specific and cross-context models. Training on RX and testing on RX yields 39.1% prediction accuracy, and training on RX and testing on TR yields similar accuracy at 38.1%; these are not statistically different (p=.44). When trained on TR data and tested on TR data, results obtained 50.7% prediction accuracy, however, when the same model was tested on RX data, the accuracy dropped to 36.7%. This difference is statically significant (using two-sided heterostatic test p=.006) allowing us to reject the hypothesis that context does not matter. We also find that the two different contexts, RX and TR, have slightly significant differences

(p=.02) in performance. This suggests that engagement models benefit when the context is known and modeled and that tasks differ in difficulty. The corresponding confusion matrices for context-specific and cross-context models are also shown in Figure 3.4.

3.5.2 Comparison results for sequence learning algorithms

In this section, we discuss in detail the results obtained for engagement prediction across variety of time windows and sequence learning algorithms alongwith its task specificity. The results are summarized in Table

3.3 and Table 3.4.

Table 3.3 shows the results for LSTM, GRU and RNN with 30 seconds of input data i.e. 900 frames.

38 RelaxationTesting TriggersTesting

Figure 3.4: The figure above shows confusion matrices for results presented in Table 3.2. The top row represents confusion matrices for RX testing and bottom row represents TR testing. The first column is within-context, the second columns is cross-context. Notice the cross-context results of Train-TR : Test-RX, with respect to the contextual model of Train-TR : Test-TR shows visible confusion for the “Engaged” class level-4 with the “Neutral” class level-3.

LSTMs trained with intensity-based AUs with 30 seconds data from the engagement segments serve as baseline for GRU-30 and RNN-30 predictions. We obtain 52.30 ± 16.94% average prediction accuracy across 20-folds on TR module and 39.44 ± 10.51% average prediction accuracy for RX module (margins of error correspond to standard deviation computed over 20-folds). We noticed significant improvements in performance for

RX module in a 2-sided tailed paired t-test (p=.03) with average prediction accuracy of 43.48 ± 11.26% for

GRU-30, whereas TR accuracy increased to 53.90 ± 17.37% there is statistically weak evidence (p=.14) of improvement over baseline. RNN-30 and GRU-30 were not statistically different at p=.7 for RX and p=.6 for

TR. RNN-30, however, was statistically significantly better than LSTM-30 for TR (p=.05) and RX (p=.016) highlighting the importance of the choice of sequence learning algorithm. The significant increase in accuracy from LSTM-30 to RNN-30 is likely due to the reduced number of parameters from LSTM to RNN, hence reducing overfitting with the simpler RNN model.

39 Triggers Relaxation LSTM - 30 52.30 ± 16.94% 39.44 ± 10.51% GRU - 30 53.90 ± 17.37% 43.48 ± 11.26% RNN - 30 54.32 ± 17.74% 44.24 ± 13.09%

Table 3.3: The baseline i.e LSTM-30 shown in the first row, used 14 AUs compared to the 20 AUs in Table 3.2. LSTM-30 also had fewer segments for each context than shown in Table 3.2. The second and third row shows our results for GRU and RNN for over 20 fold validation, with all three using exactly the same data/folds. It can be seen that RNNs are the most accurate for both contexts of Relaxation and Triggers. GRU performance is significantly better than LSTMs at 30 seconds with p=.03 for Relaxation at 97% confidence intervals using a 2-tailed paired test. RNN-30 is significantly better than LSTM-30 for both Relaxation and Triggers at p=.05 and p=.016 respectively.

Triggers Relaxation LSTM - 15 52.30 ± 16.94% 39.44 ± 10.51% LSTM - 60 52.30 ± 16.94% 39.44 ± 10.51% LSTM - 90 52.30 ± 16.94% 39.44 ± 10.87% GRU - 15 53.90 ± 17.37% 43.48 ± 11.26% GRU - 60 51.84 ± 18.28% 42.82 ± 11.50% GRU - 90 51.84 ± 18.28% 42.82 ± 11.50% RNN - 15 54.32 ± 17.74% 44.24 ± 13.09% RNN - 60 54.32 ± 17.74% 44.24 ± 13.09% RNN - 90 54.32 ± 17.74% 44.24 ± 13.09%

Table 3.4: The table above displays the results of different time-windows for all three sequence learning algorithms LSTM, GRU and RNN. The data is relatively stable over different time scales, with only the decrease in accuracy between from GRU-15 to GRU-60 being even weakly statistically significant (p=.10). Algorithm performance continues to be different, e.g. RNN-90 outperforms LSTM-90 significantly for both TR and RX at (p=.05 and p=.02 respectively).

40 The results for varying time segment lengths are presented in Table 3.4. Notice that the average 20-fold accuracy for RNN and LSTM doesn’t vary with segment length. Interestingly, for GRUs the average accuracy drops slightly when segment length is increased from 30 seconds to 60 seconds. For TR accuracy changes from 53.90 ± 17.37% to 51.84 ± 18.28% and for RX from 43.48 ± 11.26% to 42.82 ± 11.50%. There is statistically weak evidence p=.10 for TR that GRU-30 is better than GRU-60. This drop in accuracy may just be random variations, or it may be due to limited scale ranges in a GRU i.e. its cells cannot accumulate large values even if there is strong presence of a certain feature [177]. LSTM and RNN on the other hand have consistent accuracy for both RX and TR irrespective of the segment lengths. While each is self-consistent there is a difference between them; a paired test of the per-fold accuracy of LSTM-90 with RNN-90 shows significant difference with p=.02 for RX and p=.05 for TR. 20-fold validation results from both Table 3.3 and

Table 3.4 are collectively summarized in Figure 3.5.

We notice algorithms with simpler internal structure like GRUs and RNN perform consistently better than

LSTMs. Why does this happen? In the LSTM unit, the amount of memory content that is seen, or used by other units in the network is controlled by the output gate. The ability to balance forgetting old and keeping new information, and squashing of the cell state is important in LSTM architecture. On the other hand GRU exposes its full content without any control. Coupling the input and the forget gate avoids the problem of block output growing in unbounded manner, thereby making output non-linearities encoded in LSTM less important.

Given that GRUs overall have less parameters, it is likely that the amount of training data available might allow

GRUs to perform better compared to LSTM, but the data does not show a statistically significant difference.

With the motivation of reducing the potential for overfitting, we also performed another experiment where we compare different amounts of training data. We observed that GRU-15 performs significantly better than

GRU-90. This suggests that merely increasing the temporal training data is detrimental, potentially because of mixing engagement levels. However, the best performance was with an RNN, the simplest model, and the

RNN did not show any difference with different time windows.

In many machine learning applications, especially in the domain of medical/health applications, it is very time-consuming and expensive to acquire large amounts of annotated training data. While the sensory data

41 may appear to have long sequences that could be used, and there is a prevailing intuition that “more data is better”, these results show that often shorter sequences are at least as good if not better. When annotations for training data are relatively limited, simpler units like GRU or RNN should be considered.

3.6 Discussion

3.6.1 Performance of Contextual Engagement Models

The primary goal of this work was to explore contextual engagement, rather than create best deep-learning- based classifiers. We developed contextual engagement models using AUs extracted from video segments as intermediate representation and engagement self-reports as the associated predictive outcomes for the machine learning system. Across multiple experiments, the performance achieved on contextual tasks was in the order of 45-55% with standard deviation in the range of 5-12%. Further, we found this performance to be consistent across various deep learning algorithms and varying length of video segments considered for learning. The absolute performance of any machine learning system depends on multiple factors such as quality, variability and scale of data, representation and learning algorithms. AUs is a leading representation for wide range of face modeling and affective computing tasks and have shown to perform better than low-level representations like LBP and Gabor filters [122, 69, 241]. Recent advances in deep learning have led to development of better AU detectors that are robust to variations in pose and lighting [46, 222, 208]. AU detectors in affective computing and computer vision literature have focussed primarily on the analysis of various spatial regions in the face images. Recent work of Li etal [135] has shown that facial action always has a temporal component and hence knowing the previous state of a facial expression can definitely improve the AU detection. Such approaches are likely to boost performance of temporal engagement models as presented in this work. Our experiments in earlier section showed that it was difficult to get high performing models with shallow learning algorithms such as support vector machines. Hence, we modeled the problem as a sequence learning problem and employed multiple recurrent neural network based algorithms (RNN, GRU, LSTM). A continued direction of this work will be in the domain of incorporating recent advances in facial action unit detection and deep

42 LSTM GRU RNN 100

80

60 Triggers

40

20

0 15 30 60 90 15 30 60 90 15 30 60 90

100

80

60 Relaxation

40

20

0 15 30 60 90 15 30 60 90 15 30 60 90

Figure 3.5: The figure above shows the box and whisker plots for LSTM, RNN and GRU in different contexts of Relaxation and Triggers. Each subplot in shows the Accuracy (%) on the y-axis and the Time Windows (seconds) on the x-axis. Each boxplot is generated from the 20-fold validation results, showing mean (green line), median (orange line), 25% (solid box) and 75% (whisker lines) percentile bands and outlier (circles). Notice that RNNs and LSTMs perform consistently across different temporal depths whereas GRU decreases from 30 to 60 for Triggers.

43 learning algorithms to improve overall performance of the machine learning models.

The facial action unit detector used in this work were based on open-source classifiers that were trained on other publicly available datasets such as DISFA, FERA2011, SEMAINE and others. More recently, Chu etal.[53] noted that generic classifiers trained from multiple datasets often have impaired generalization ability because of individual differences among subjects. They hypothesized that person specific bias causes standard generic classifiers to perform worse on some subjects than other. In our work, it is likely that generic AU detectors fail to model inter-subject variability adequately leading to high-variance engagement models. In order to incorporate person specific bias within engagement models, there is a need to personalize engagement models using person-specific data at various stages in the engagement prediction pipeline. Personalized models can be built either by adapting pre-trained AU detectors with person specific data (as proposed by Chu etal.,

Yang etal.and Zen etal.[53, 233, 238]) or by incorporating person specific temporal variabilities (e.g. if a person has habit of nodding head or raising eye brows etc.) in generalized engagement models.

3.6.2 Incorporating Broader Range of Data

In the current work, we primarily focussed on developing machine learning models using engagement self- reports from trauma subjects. The self-reports provided extremely sparse labels per video (typically 2-3 self-reports for video session of 15-40 mins). Traditional approaches of engagement prediction in domain of affective computing use static representations of individual frames [45, 157, 95]. We used deep-sequence- learning models for engagement prediction, in order to incorporate changing interactions over time and concurrent sequential dynamics. The most natural way of building better classifiers is training with an even larger dataset and performing parameter optimization of sequence-learning models. Sophisticated methods like feature (AUs) pooling over space and time could be another possibility which would also likely improve prediction accuracy. More recently, advances in activity recognition [48, 23] and video classification [154] techniques have shown that using Convolutional Neural Networks with a combination of sequence-learning algorithms are capable of learning longterm temporal representations [230]. Generating video level vector representations on top of the frame level Convolutional Neural Networks descriptors could be explored in the

44 future.

Due to sparsity of labels (engagement self-reports), we resorted to considering approximately 15-90 seconds video segment before the self-report was captured. In our work, we saw that increasing length of the video segment did not change the performance drastically. Deep learning models often require large amounts of training data to learn powerful representations. As part of dataset collection, we collected behavioral engagement annotations from trained psychologists to obtain per-second engagement annotations. In Chapter

5, we incorporate continuous annotations in our machine learning framework and compare it to the method described in this chapter. CARMA ratings will provide multiple annotations per frame, which will provide dense labels for sequence learning thereby enabling us to consider longer temporal sequences and build continuos engagement predictors. A fusion mechanism to consider continuous annotations from external observations to get dense labels and provide validity of engagement through self-reports from trauma subjects is demonstrated (refer Chapter 5) to provide “best of both the worlds” approach to improving absolute performance of contextual engagement models.

45 Chapter 4

Mood-Aware Engagement Prediction

4.1 Introduction

Facial analysis gives strong clues to the internal emotional state of a person [130, 59]. In some scenarios, it is possible to judge emotion of a person through facial expressions, but internal mood states of person cannot be observed through such methods [79]. This effect is elevated in case of subjects suffering from trauma, where they are often reluctant to express themselves openly. Negative mood has been linked to poorer cognitive and independent functioning in trauma recovery subjects leading to lower quality of life, higher mortality and greater declines in physical and mental health status [49, 125].

Various psychology studies have shown the effect of mood states on the cognitive abilities of an individual

[155, 171, 138]. A series of user studies examined effects of moods on depression [182], anxiety disorders

[88], performance [47], learning-based attention [138] etc. Mood states have also proven to have significant effects on user behavior [110] and self-evaluation [172, 170]. Since our experiments in Chapter 3 demonstrate that engagement prediction models are contextual. We take this a step further and ask the question: “Does using current mood as context improve engagement prediction for a given task?”. In order to answer this question, we use Profile Of Mood States (POMS) data (see section 4.3.1) that was collected before and after the module from each subject in the EASE dataset. We also create automated mood disturbance predictors and

46 compare the performance of automated mood-models with self-reported moods for engagement prediction.

Automated mood prediction methods from facial videos have huge potential to have a significant impact not only for trauma recovery but across a wider-range of domains. Recently, Katsimerou etal.[120] conducted one of the first studies on automated mood prediction from visual input and noted that “mood estimation from recognized emotions is still in its infancy and requires separate attention.” In affective computing literature, often emotion and mood are used interchangeably. However, as noted by Ekkekais etal.[78], these are distinct characteristics of one’s affective state. Emotion is characterized by a set of inter-related sub-events concerned with a particular object, person or thing. However, mood typically lasts longer than emotions and is generic rather than about a specific object or a thing. Mood is heavily influenced by external factors such as environment, physiology, and current emotions. Mood is dynamic and can last minutes, hours or even days.

Moods are emotions over sustained time [80, 125, 79].

Tension Facial Feature Mood Prediction Anger Confusion Extraction LSTM Depression Fatigue TMD Score Vigor POMS Self POMS Reports TMD Score

Mood Prediction Context

switch Context 1 Facial Mood Pre- (LSTM) Engaged Feature Conditioning

Extraction Context 2 Disengaged (LSTM)

Figure 4.1: Automated Mood Prediction for Mood-Aware Context Sensitive Engagement Prediction: Trauma patients are often reluctant to express themselves openly, suffer from mood changes that last an extended period, which in turn effects their cognitive abilities. We propose Automated Mood-Aware Contextual Engagement prediction for trauma subjects. The top part of the figure shows mood prediction pipeline aimed at predicting the mood of trauma patients from facial videos. The mood estimates are then used to pre-condition learning for context sensitive engagement prediction models. Temporal deep learning methods are used to learn long-term dependencies to estimate mood and its interplay with contextual engagement.

In this work, we take advantage of recent advances in computer vision and deep learning and show that

47 facial video data captured over sustained periods can reliably predict the mood of a person with sufficient accuracy to use it in engagement prediction. We build on top of our contextual engagement predictors from facial videos (refer Chapter 3) and develop a framework for mood-aware contextual engagement prediction.

We show the interplay of subject mood and engagement using a very short version of POMS to extend our

LSTM engagement model. An overview of the employed system is shown in Figure 4.1. The contribution of this work are as follows:

1. Exploring the relationship of a subject’s mood as an initialization parameter for engagement estimation.

User mood state is retrieved from a very short form of the Profile of Mood States (POMS) questionnaire’s

asked before the beginning of task, explained in section 4.3. Our experiments demonstrate that contextual

engagement models show improvement in engagement prediction performance by combining AUs with

POMS data.

2. We explore automated prediction of mood disturbance based on automatically computed AUs and

LSTMs. We develop mood prediction models in the domain of trauma recovery. To the best of our

knowledge, this is the first work of its kind.

3. We show that contextual engagement models can be enhanced by incorporating automated mood

predictions for trauma recovery subjects.

4. We build automated models for mood prediction at sub-scale and total mood disturbance levels, demon-

strating the importance of sub-scale modeling for mood prediction.

The assessment of mood is an important indicator for the evaluation of intervention effects. For example, the standard POMS-SF consists of a detailed questionnaire of 37 questions, which some subjects find intrusive and burdening. Subjects suffering from trauma are often distracted, lack focus, or at times are incapable of providing such detailed feedback [77]. Hence, there is a need to develop automated non-intrusive methods for mood prediction. As shown in Figure 4.2 some aspect of mood is significantly changed by each module of

Relaxation and Triggers, so if mood is to be used during treatment it would require multiple measurements, which further increases the need for automated mood estimation. Our mood-aware engagement predictor uses

48 total mood disturbance score, and our analysis compares both mood sub-scale predictors and an overall mood disturbance predictor for engagement prediction. Our experiments show that mood-aware engagement predictor using our novel visual analysis approach performs significantly better or on par with using self-reports.

4.2 Related Work

Our work has relations to methods and techniques explored from multiple communities such as psychology, affective computing, deep learning, web-based intervention, trauma recovery. Further, our work draws inspiration and builds on top of existing research works from these areas which are reviewed in this section.

Mood Assessment and Measures: Affect, expressions, emotions, and mood are related, yet conceptually distinct in terms of the phenomenon they represent [80]. Klienke etal.[125] and Adelmann etal.[3] have demonstrated the effects of facial expression on mood states of subjects through extensive psychological studies. Studies in psychology (see Ekkekakis etal.[78] for a detailed survey) has also lead to the development of various mood assessment measures such as Profile of Mood States and Positive and Negative Affect

Schedule (PANAS). In the domain of sports physiology, Wang etal.[223] presented a measurement of mood states from physiological signals. More recently Sano etal.[185] presented a system for automatic stress and mood assessment from daily behaviors and sleeping patterns. More recently, Katsimerou etal.[120] proposed a novel framework for predicting mood, as perceived by other humans, from the emotional expressions of a person. The key differences between the approach of Katsimerou etal.and ours are that we predict mood from user self-reports rather than external annotators and further, we demonstrate the interplay of mood on engagement tasks for web-based trauma recovery.

Engagement Prediction: Engagement prediction from facial video data followed by user inputs (either self-reports or external annotation) has been looked into by researchers from the domains of student learning

[157, 226, 95] and human-robot interaction[184, 44]. In the domain of student learning, methods typically involve extracting facial features followed by machine learning algorithms to predict student engagement.

In the case of human-robot interaction, works of Castellano etal.[43] and Salam etal.[184] have shown that

49 engagement prediction is contextual and task dependent. They have demonstrated that additional knowledge of user context (e.g. who the user is, where they are, with whom they are, the task at hand, etc.) can better predict an affective state of the user during Human-Robot Interactions (HRI). In the case of both student learning and HRI, it is assumed that subjects (students) are co-operative and in control of their emotions, which is not typically the case for trauma subjects.

Deep Learning for Affect Detection: Significant advances in deep learning have lead to the development of various affect detection methods based on deep learning. More specifically, researchers have applied deep learning techniques to problem such as continuous emotion detection [199], facial expression analysis [221], facial action unit detection [52] and others. Deep learning methods have used video data, sensor data or multi-modal data [228]. As noted earlier, moods are diffused over longer durations than emotions, and hence there is need to employ methods that can integrate long-term temporal information in learning framework.

Such problems are often modeled as sequence learning problems [99, 97]. To address sequence learning, various methods such as Recurrent Neural Networks, Gated Recurrent Units, and Long Short-Term Memory were proposed and employed in wide range of problems [100]. In this work, we explore LSTMs (refer section

3.3.3), a specialized form of recurrent neural networks, to model long-term mood prediction and mood-aware engagement prediction.

4.3 Experiments

We now describe the mood data from EASE dataset and the AU computation procedure to extract intermediate feature representation, followed by the methodology for sequence learning using LSTMs and the various

LSTM mood/engagement models. Lastly, we elaborate the data used for training and testing.

4.3.1 Profile of Mood States (POMS)

The Profiles of Mood States-Short Form (POMS-SF) [196, 77] are used to measure reactive changes in the mood of a person. It is a list of 37 items related to depression, vigor, tension, anger, fatigue, and confusion. In

50 EASE data the TMD score was calculated from a 24-point questionnaire instead of 37 to reduce the cognitive load of trauma subjects. Participants rate items for how they feel at the moment on a 5-point scale, ranging from 1 (not at all) to 5 (extremely). Each question corresponds to a specific sub-scale of mood, e.g., tension: negative sentiment (5 questions), depression: negative sentiment (6 questions), anger: negative sentiment (5 questions), fatigue: negative sentiment (2 questions), confusion: negative sentiment (2 questions) and vigor: positive sentiment (4 questions). The final TMD (Total Mood Disturbance) level is computed as difference of sum of negative n(x) and positive p(x) sentiments:

TMD = n(x) − p(x) (4.1) X X x∈negative sentiments x∈positive sentiment

This self-report is given to the subjects at baseline and immediately before and after each module. Each video consisted of mood states report before and after the module (pre-POMS report and post-POMS report).

Participants also provided self-reports about their engagement level approx. three times (start, middle, and end of the segment) during each session. Figure 4.2 shows the changes in mood at the TMD and sub-scale level in the EASE dataset. Greater TMD scores are indicate of subjects with more unstable mode profiles and lower scores indicate stable mood profiles.

4.3.2 Facial action units extraction

As noted in earlier section 2.5.1, the collected dataset consisted of a large number of facial video and engagement self-reports. While there are number of software available for extracting facial landmark points and facial action units (e.g. [63, 114]) we use the recent work on OpenFace proposed by Baltrusaitisˇ etal.[11].

It is an open-source tool which has shown state-of-the-art performance on multiple tasks such as head-pose, eye-gaze, and AU detection. For our work, we primarily focus on facial action units shown in figure 3.2.

The AUs extracted consisted of intensity based and presence based AUs. We consider all 20 AUs for the experiments in this chapter. The list of AUs used in this work are as follows: Inner Brow Raiser, Outer Brow

Raiser, Brow Lowerer (intensity), Upper Lid Raiser, Cheek Raiser, Nose Wrinkler, Upper Lip Raiser, Lip

51 Change in Mood Pre/Post Module 1.20 1.00 Relax 1 |******* Relax 2 |*****-- 0.80 Triggers 1 |------* 0.60 Triggers 2 |-*----- 0.40

0.20

0.00

-0.20 TMD Vigor Anger Tension Fague Depression Confusion

Figure 4.2: Changes in subject moods pre/post each module in each session. We tested each change for significant with a two-sided paired t-test. The legend, a * means that scale/subscale showed a statistically significant decrease change at the .01 level. Because of the significant changes during each module one cannot compute TMD once and reuse it across the treatment – we need automated mood estimation.

Corner Puller (intensity), Dimpler, Lip Corner Depressor (intensity), Chin Raiser, Lip Stretched, Lips Part, Jaw

Drop, Brow Lowerer (presence), Lip Corner Puller (presence), Lip Corner Depressor (presence), Lip Tightner,

Lip Suck, Blink.

4.3.3 LSTM Engagement and Mood Models

In this work, we model four variants of engagement prediction using the same basic LSTM cell which are as follows (each model is a single layered LSTM):

1. Engagement multi-class classifier

We optimize the LSTM to predict a discrete set of engagement levels 1 (very disengaged) through 5

(very engaged), by minimizing the cross-entropy loss of the predicted and actual class after applying

softmax function. By doing this, each engagement level is treated as a separate and mutually exclusive

output. The baseline of contextual engagement in table 3.2 was recomputed to include the subjects that

52 Trauma Paent Facial Video Segment start end

….. …..

Data for Visual POMS 30 secs 30 secs 30 secs 30 secs regression 30 secs 30 secs 30 secs Data for Visual POMS Data for regression E1 E2 Engagement Predicon E3

Pre-­‐POMS Post-­‐POMS Measurement Engagement Responses Measurement

Figure 4.3: Video segment selection: The process of selecting video segments from a particular video session for training POMS models and engagement models is shown in above figure. Each video segment consists of POMS measurement at the start and end of the session. For training POMS regression models POMS-TMD and POMS-TMDS, we consider two 30 seconds segments from pre-POMS and post-POMS measurement. Engagement self-reports are collected at approx. 3 time-points during the module. Training data for engagement prediction consists of 30 second segments prior to each engagement self-report. From each of these frames (either for POMS or engagement), AUs are extracted from video frames and used as intermediate feature representation for training LSTMs

had POMS data available and therefore has fewer segments than the contextual engagement models in

Chapter 3.

2. Mood-aware engagement prediction : POMS-SR

We precondition the basic engagement multi-class LSTM with TMD scores obtained using self-reports.

We max normalize the TMD scores during training and propagate the normalized score along with the

intermediate AU representations by adding the normalized score to the AU representation.

3. Mood Disturbance predictor : POMS-TMD

POMS-TMD is a single-output regressor. This LSTM model is optimized to predict continuous POMS-

TMD scores in range 0-24, by back-propagating the L2-regularization loss. Using L2-regularization, we

penalize the outliers heavily by adding a penalty on the norm of the weights to the loss.

4. Mood Sub-scale predictor : POMS-TMDS

53 The POMS sub-scale LSTM (POMS-TMDS) is a multi-output regressor. It is optimized to predict six

sub scales of tension, vigor, depression, anger, confusion and fatigue each in range 1-5, by computing

the L2-regularization loss and an Adam optimizer with a learning rate of 0.1 trained over 15 epochs, see

Figure 4.4 for a choice of epoch-size. The estimated sub-scale values are then used to compute a TMD

score estimate.

30 Training loss RX Testing loss RX 25 Training loss TR Testing loss TR

20

15 L2 Loss

10

5

0 0 20 40 60 80 100 120 Epochs

Figure 4.4: Convergence Plot: The above figure shows L2 loss for training LSTM model for Mood Prediction as a function of Epochs. It can be observed that it takes approximately 50-60 epochs for train and test losses to converge for both RX and TR modules. After the convergence point, the model is prone to over-fitting.

4.3.4 Cross-Validation and Train-test pipeline

Our multi-class engagement model consists of a 20 fold validation with 420 segments of 30 seconds each in training set and 46 segments of 30 seconds each in testing set for Trigger (TR) module. Similarly, in Relaxation

54 (RX) module, our training set consists of 313 segments of 30 seconds each and 35 segments of 30 seconds each in testing module. Each segment had a self-reported engagement score as ground-truth on a scale of 1-5.

This led to the availability of AUs from 378K frames for training LSTM for engagement prediction for TR module and AUs from 282K frames for training in RX module. Mood-aware contextual engagement LSTMs were trained based on context.

In order to test the generalizability of POMS-aware modeling, we use the more traditional Leave-One-

Subject-Out (LOSO) methodology. The baseline LOSO model was built on Engagement multi-class classifier, mentioned in section 4.3.3. The LOSO mood-aware models used the POMS-SR technique to predict categorical engagement. Both baseline and mood-aware LOSO methods had data from 90 subjects for TR and 92 subjects for RX.

Our automated mood LSTMs i.e. POMS-TMD and POMS-TMDS, comprised of 10 fold cross-validation.

Each fold had 600 segments in training set and 67 segments in test set. As shown in Figure 4.3 two 30 second segments were selected with same POMS label to increase the effective size of training samples.

Each segment was accompanied by POMS self report collected before/after the module as shown in Figure

4.3. TMD scores from self reports were used to condition the AUs and form the POMS-SR baseline. We formulate the problem of automated mood prediction as a sequence based single-output (POMS-TMD) or multi-output (POMS-TMDS) regression and train LSTM to predict a TMD score or individual mood clusters.

Finally, we used automated mood prediction pipeline to compute TMD scores from both POMS-TMD and

POMS-TMDS models. These automated TMD scores were used to pre-condition AUs and LSTMs were trained for engagement prediction. Convergence plot for one fold of data on the POMS-TMD LSTM is shown in Figure 4.4. It took approximately 5-7 epochs for both TR and RX contextual models to stabilize test loss. After 60 epochs the model starts overfitting, therefore, we selected 15 epochs as a constant for all cross-validation model evaluations irrespective of the context.

55 4.4 Evaluation Results

In this section, we discuss in detail the results obtained for engagement prediction across variety of tasks and its task specificity. The results are summarized in Table 4.1 and 4.2.

RX TR LSTM Baseline 0.9989 0.7653 POMS-aware LSTM 0.9493 0.6786

Table 4.1: The effect of POMS on contextual engagement prediction using Leave-One-Subject-Out (LOSO) validation. Augmenting AUs with POMS data shows clear reduction in RMSE; the difference for TR is statistically significant with a two-sided paired t-test at p=.0007.

Since the engagement scores are ordinal, not categorical, for POMS-aware modeling with LOSO method- ology, we report root-mean-squared-error. The performance of POMS-aware engagement predictions are summarized in Table 4.1. Even though the LSTM model was optimized for categorical correctness, we notice a significant improvement in performance by augmenting AUs with POMS data. POMS-aware engagement model for TR (POMS-TR) showed a significant reduction in error (p=.0007) at 99% confidence intervals. Due to contextual nature of engagement prediction (as shown in Chapter 3), we do not create POMS-aware models for cross-contextual or combined (RX+TR) models.

Trigger Relaxation Engagement-baseline 48.55 ± 17.7 % 42.04 ± 11.7 % Engagement-Mood Aware 52.57 ± 17.2% 43.88 ± 11.9 % POMS-SR Engagement-Automated MA 54.04 ± 18.14 % 46.14 ± 14.38 % POMS-TMD Engagement-Automated MA 55.54 ± 18.43 % 45.42 ± 12.99 % POMS-TMDS

Table 4.2: The first row shows engagement prediction results without mood conditioning. The second row represents mood pre-conditioning performed with POMS self reports for engagement prediction. The third and fourth row show engagement prediction accuracy with mood conditioning performed POMS-TMD estimates from temporal deep learning based mood prediction. We note that the performance of engagement predictions can be improved by pre-conditioning with POMS self-report data. Accuracy is further enhanced by pre- conditioning with visual POMS-TMD estimates, suggesting effectiveness of automated mood prediction. The performance numbers are obtained using 20-fold cross validation, with accuracy along with standard deviation reported in the above table.

Table 4.2 shows the results of baseline and mood-aware engagement predictors in terms of prediction accuracy. We train separate contextual LSTMs for each contextual model (RX and TR) i.e. engagement

56 predictions, POMS-TMD and POMS-TMDS. LSTMs trained with only AUs from the engagement segments serve as baseline for all predictions. We obtain 48.55 ± 17.7 % average prediction accuracy across 20-folds on

TR module and 42.04 ± 11.7 % average prediction accuracy for RX module (margins of error correspond to standard deviation computed over 20-folds). We noticed significant improvements in performance for

TR module in a 2-sided tailed paired t-test (p=.07) with average prediction accuracy of 52.57 ± 17.21% for POMS-SR, whereas RX accuracy increased to 43.88 ± 11.9% but was not statistically different (p=.26), consistent with the findings in Table 4.1. Using the visual estimations from POMS-TMD regressor increased engagement accuracy significantly for TR at p=.05 level when compared to POMS-SR, highlighting the effect of using visual POMS rather than self-reports. On the other hand, for RX although the visual predictors of

POMS-TMD and POMS-TMDS show improvements in accuracy from baseline 46.14 ± 14.3 % and 45.42 ±

12.9 %, there is statistically weak evidence of improvement (p=.19) over POMS-SR.

While evaluating, POMS-TMD regressor obtained 0.03673 average Mean Squared Error (MSE) with

0.0153 standard deviation in mood prediction for TR module and 0.03633 average MSE with 0.01227 standard deviation in mood prediction for RX module using the POMS-TMDS predictor. Average MSE and standard deviation for mood prediction were computed over 10-fold cross validation. However, evaluating the performance of visual mood-aware regressors on a standalone basis is not required, since the visual regressors surpass the POMS-SR. However, for completeness, the POMS-TMD and POMS-TMDS results are shown in

Figure 4.5 alongwith the actual scores from self-reported POMS data. Moreover, the statistically insignificant results in RX could be attributed to the significant changes in mood states during the task of RX vs TR, analyzed in Figure 4.2. Such significant changes in mood-states during a task, strongly suggest the need to build continuous predictors of mood states from video in order to monitor the users during an intervention.

4.5 Broader Impact

There are multiple implications of the proposed work. Mood and its relationship with other cognitive abilities has been widely studied in psychology literature. Collecting questionnaires from subjects is cumbersome

57 genln) ein(rneln) n h ne-urierneie pe 7t ecnie n oe (25th lower and percentile) (75th upper i.e. range inter-quartile the and line), (orange median line), (green ag fmo-crs h eno h culadpeitdPM ssmlra M n u-cl levels. sub-scale and TMD dynamic at full similar the is over POMS predict predicted not and do actual values the predicted of the mean though the mean even mood-scores, the that of shows Notice, range boxplot box. Each the x-axis. of Each the lines on values. percentile) mood-type sub-scale the and values and TMD sub-scale predicted y-axis and the the TMD show on Actual right mood-scores the the the show on shows left subplots subplot the the on and subplots self-reports The subject modules. from Triggers as well as Relaxation 4.5: Figure

uoae odPredictors: Mood Automated RELAXATION TRIGGERS 10 10 2 4 6 8 2 4 6 8

TMD TMD

Vigor Vigor Actual POMS Anger Anger

Depression Depression

Tension Tension hsfiuesostePM-M n OSTD ausfor values POMS-TMDS and POMS-TMD the shows figure This

Confusion Confusion

58 Fatigue Fatigue

TMD TMD

Vigor Vigor Predicted POMS

Anger Anger

Depression Depression

Tension Tension

Confusion Confusion

Fatigue Fatigue process and in many cases like trauma recovery, poses additional burden on the subjects. Automated mood prediction using facial video data provides a scalable platform to study wide range of behavioral tasks and opens up multiple opportunities in the domains of trauma recovery, elderly care, treating mood disorders and other rehabilitation. In the domain of web-intervention, automated mood and engagement prediction provides a mechanism to build adaptive evidence-based treatment[235, 153].

Beyond psychological studies, automated mood prediction provides applications in other areas as well.

In human robot interaction [184], mood assessment can lead to adapting robot behavior based on subjects affective state. In the domain of multi-player games [234], one can start addressing questions like do a player’s actions reveal a friendly/aggressive mood towards the other players? Can we use the player’s actions to predict subsequent actions? In domain of computer-aided instruction [226], understanding mood and engagement in automated and scalable way can help devise better learning tools or personalized instruction mechanisms.

In workplaces, understanding performance during job-interview, analyzing stress of employees, assessing performance of call center employees during long conversations can be significantly improved by automated mood and engagement prediction methods. In the domain of computational advertising [148], automated mood-aware engagement prediction methods can help create more sophisticated tools to better align advertising needs of users and content creators. In recent years, number of mobile phone applications such as Pacifica,

Mindshift, Moods, Mymoodtracker and others have paved way for automated tracking and analysis of mood based on self-reported mood data. There is significant potential for methods based on facial analysis and temporal modeling to aid such automated mood tracking mobile phone applications.

4.5.1 Robust Mood and Mood-Aware Engagement Models

In this work we presented a method for automated detection of total mood disturbance and mood-subscale prediction from facial action units and LSTMs. Mood is distinguished from emotions typically based on its duration (mood tends to be longer), hence powerful sequence learning algorithms such as LSTMs were well suited for this task. Our experiments show that mood prediction using visual estimates of POMS-TMD and

POMS-TMDS performs better or at-par with self-reported TMD. We used the detected mood to aid engagement

59 prediction models. Mood-aware engagement models presented in this work (developed using only video data) outperformed baseline engagement models. A natural extension of this work would be enhanced mood prediction models by incorporating contextual information. A detailed analysis of video data is warranted for understanding phenomenon such as relationship of mood and arousal, mood shifts and person-specific temporal association of facial expressions on mood. Mood prediction models should incorporate additional information such as situational context, scene semantics, personal behaviors to get in-depth understanding of relationship of facial expressions and associated mood predictions.

Mood-aware engagement models presented in this work were developed by pre-conditioning AUs with

POMStmd score to normalize AU representation. Pre-conditioning with mood data was helpful in this work because mood self-reports were available at the start of the session. However, as we move towards developing automated mood prediction methods there is need to continuously condition and update AUs using information from mood classifiers on video data. Further, this conditioning should incorporate long-term mood shifts into engagement prediction models for more reliable predictions. The pre-conditioning technique used in this work used only pre-POMS for conditioning AUs. In actuality, subjects do not change their mood as an on/off switch; continuous mood-prediction should be explored for web-based trauma recovery. The conditioning of pre-POMS could be improvised by using the latest advances in reinforcement learning to add non-time-series meta-data for temporal sequential decision making [119, 219]. Post-POMS could be incorporated as high- level reward for better vector representations. In deep-reinforcement learning, state-of-the-art results have been achieved by directly maximizing cumulative reward using high-level spatio-temporal representations for end-to-end learning [113]. Finally, there is need to develop a framework for conditioning variety of affective/emotion features with mood data, not just AUs, to develop truly scalable and robust mood-aware affect prediction models.

60 Chapter 5

Automated Action Units Vs. Expert

Raters

5.1 Introduction

Exploration of new research avenues in the fields of computer vision and affective computing have leveraged tools and techniques from the field of crowdsourcing [151, 200]. The emergence of sources like MTurk,

Crowdflower, etc. to obtain external annotations for large-scale data mining are now commonplace. Most affective computing applications rely on expert raters to obtain continuous labels of affective and cognitive states [41, 181, 38]. Simultaneously, recent research [108] has shown that it is feasible to deploy light weight

CNN-based architectures on real-time video streams for affect detection. The current gold-standard for measuring user engagement is estimated from self-reports. Questionnaires are simple and inexpensive, but they distract the user from the task at hand or increase cognitive load as noted in Chapter 1. Obtaining self-reports is problematic, especially for people suffering from mental-disorders whose symptoms include but are not limited to reduced ability to concentrate, trouble understanding and frequent mood-swings. Automated continuous prediction of engagement is therefore of utmost importance to numerous applications. In this chapter, we build

61 on prior machine-learning Automated AUs models developed in Chapter 3 and compare them to engagement models built on expert perceptions. We also build a novel model that combines “best of both worlds” (refer reason from Section 3.6.2).

In recent years, psychologist have turned to web/mobile based intervention as an effective way to provide treatment and therapy for a number of mental health challenges such as depression, mood management, anxiety disorders, trauma recovery and others [8]. A number of commercial solutions such as those offered by Pacifica,

Ginger.io [2, 1] have been developed for stress/anxiety management and emotional well-being as an effective way for sustainable mental health. These solutions help patients monitor their progress towards recovery through interactive questionnaires about their well-being (mood, stress, sleep patterns, etc.) and providing remote psychologist and peer-community support. Such self-support tools cannot be too generic for trauma recovery and need to autonomously adapt to the patient‘s needs based on their mental and physical state [235].

Although ample evidence exists for the clinical effectiveness of web/mobile interventions across a wide base of interventions, in case of trauma subjects who often experience emotional detachment and disengagement from mundane tasks, effective engagement with these interventions remains a significant concern [156].

As noted in earlier in Section 3.6.2, incorporating dense-labels from external observer annotations have the possibility to outperform existing automated engagement estimation models (developed in prior-work

[66]).The ground truth for engagement prediction consists of either sparse labels obtained from self-reports or continuous annotations (observational estimations) from expert external human observers. Self-reports, questionnaires completed by subject to reporting their level of engagement, provide accurate feedback about subject’s well-being. However, repeated questionnaires impose a cognitive burden on trauma subjects. An alternative approach is to obtain continuous annotations from trained psychology experts who use facial and behavioral cues to make judgments about subjects engagement [91]. These judgments are standardized across observers through use of measurement tools [89]. Training psychologist to make such judgments is an iterative, cumbersome and labor-intensive process. Trauma subjects, depending on the condition, can either be very expressive or can tend to suppress emotions and thereby make video annotation process tiring, difficult and costly for raters. Furthermore, estimating inter-observer reliability is not always straight-forward and can lead

62 Annotate perceived engagement

HUMAN RATERS

Training Phase

MACHINES UPPER%FACE%ACTION%UNITS% ? Esmate Perceived

LOWER%FACE%ACTION%UNITS% Engagement

Application Application User User Face Video Sequence Automated extraction of Facial Action Units

Figure 5.1: Designing effective applications involve understanding the engagement levels of the target population. Machine learning models that estimate user engagement have two choices: learning external annotations of perceived engagement provided by expert raters or learn vision based models from facial expression representations. The top pipeline in the above figure represents the expert raters annotation process. Raters undergo a training phase to formulate rules for the modality being measured, e.g., behaviors indicative of engagement/disengagement. Despite rigorous training efforts, the raters do not always agree upon engagement levels. The bottom pipeline is automated processes using facial feature extraction in the form of Action Units (AUs). We can directly estimate engagement from the AU or use the AUs to estimate the expert rater scores. This work compares the expert raters to automated AUs for estimating user engagement and also proposes a new scalable approach to combine the best of both worlds. to significantly varying annotations (see Figure 5.1). The contributions of this chapter are addressing these fundamental questions:

• How robust are annotations obtained from expert human observers?

• How do human observers fare compared to machine learned features for engagement prediction?

• How can we develop vision systems that can learn and use inter-observer variability in prediction?

5.2 Related Work

This work lies at the intersection of multiple emerging areas which we briefly review and compare in this section.

Visual Engagement Prediction: Predicting subject engagement from facial data has been an active area of research in computer vision and affective computing literature in recent years. Monkaresi etal.[157]

63 developed methods for detecting engagement when subjects were performing a structured writing activity.

Hernandez etal.[106] developed facial features and head gesture-based method to measure the engagement of viewers while watching television. Mcduff etal.[151] explored the role of emotions and engagement for media applications to understand the effectiveness of video advertising. Whitehill etal.[226] developed various student engagement methodologies by analyzing facial expressions and machine learning techniques.

They developed binary engagement detectors to estimate high/low engagement levels in video segments

(monitoring students) and found the performance of the system comparable to humans. In recent past, Kamath etal.[118] proposed a crowdsourced discriminative learning approach/system for e-learning and estimated student engagement. Almost all of the aforementioned techniques have been confined to student-learning or advertising. Most works also create a machine learning approach with a chosen set of features (either vision based or from external annotations), and no comparison is done for perceived engagement with automated vision-based techniques. Unlike student learning, in trauma recovery applications, subjects often do not have control over their emotions and suffer from varying degrees of disengagement raising unique challenges.

Multi-Rater Annotations: Supervised learning methods typically consist of a label from single annotator per training sample. When working with perceived emotions, it is common practice to get feedback from multiple annotators to avoid bias. However, this process often leads to noisy labels with varying degrees of inter-rater agreements. Raykar etal.[179] presented a probabilistic approach for learning with data from multiple annotators and proposed algorithm that evaluates the labels obtained from different annotators and also gives an estimate of the hidden labels. Wellinder etal.[225] presented an approach to model multiple non-expert annotators as a multidimensional entity with variables representing competence, expertise, and bias. Our work consisted of psychology research assistants who underwent rigorous training for this task

(experts). Hence, such model would not be applicable. In Bioinformatics, Valizadegan etal.[216] presented a consensus approach for learning with multiple annotators, wherein a separate model was trained from each annotator, and the results were fused to obtain a consensus model. In such approaches, it is hard to estimate the source of variance [166]. In the domain of affective computing, continuous measurement of perceived emotions (annotations) was popularized by Gottman etal.[94] with the introduction of “affect rating dial”.

64 Raters were instructed to turn the knob clockwise or counter-clockwise to report their affective experience

More recently number of annotations tools have been developed to collect continuous ratings as proposed by

Brugman etal.[35] (ELAN), Kipp etal.[124] (ANVIL), Nagel etal.[165] (EmuJoy) and others. The dataset we used for this work consisted of audio, video and physiological signals (only video signals considered in this work). For the EASE dataset that we use for this work (details in Section 5.4), an annotation tool developed by

Jirard etal.called Continuous Affect Rating and Media Annotation (CARMA) was used. The tool is based on the works of Gottman etal.. CARMA is a media annotation program that collects continuous ratings from observers with the flexibility of using data from multiple modalities, and hence, was suitable for data collection

(see Figure 5.2, for details on annotation process).

5.3 Sequential learning model using RNN: automated, expert raters

As demonstrated in Table 3.4, Recurrent Neural Networks (RNNs) perform best amongst sequential algorithms evaluated for engagement prediction. RNNs are well suited for problems involving sequential information and have found relevance in a wide range of tasks, in recent years, ranging from machine translation, video classification, natural language processing, image captioning, visual question answering and others. RNNs learn a representation from a sequence of input vectors [191, 93]. Engagement prediction from facial videos is inherently a sequence prediction task. There are series of facial expressions (characterized by facial action units) that the system sees and it has to make predictions about the level of engagement. Hence, we modify, adapt and build on prior Automated AUs models (from Chapter 3) to develop algorithms for engagement prediction with RNNs in this section.

An RNN is a neural network that consists of a hidden state h which operates on a variable-length sequence x =(x1,...,xT ). At each time step t, the hidden state ht of RNN is updated by:

ht = θφ(ht−1)+ θxxt (5.1a)

yt = θyφ(ht) (5.1b)

65 where φ is the non-linear activation function and y is the target output unit. The activation function φ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit [86]. In this work, the input sequence x is composed of either facial action units xAU or expert ratings xHA. The target output y for a given video segment consist of self-reports obtained from trauma recovery subjects.

During the back propagation through recurrent units, the derivative of each node is dependent of all the nodes which processed earlier. To compute ∂ht a series of multiplication from k to k t is required. ∂hk = 1 = − 1

Assume that φ˙ is bounded by α then ∂ht <αt−k || ∂hk ||

= ∂E t S ∂E = t (5.2) ∂θ X ∂θ t=1

= ∂E k t ∂E ∂y ∂h ∂h t = t t t k (5.3) ∂θ X ∂yt ∂ht ∂hk ∂θ k=1

t t ∂ht ∂hi T = = θ diag[φ˙(hi−1)] (5.4) ∂hk Y ∂hi 1 Y i=k+1 − i=k+1

where E is the loss of tth layer. A solution to this problem is to use gated structures. The gates can control back propagation flow between each node [86, 215, 93].

5.4 EASE Subset & Annotations

For this work, we use a subset of the original EASE dataset for which expert rater annotations are available (the

Triggers module). Expert annotations by psychologists is a multi-step process [91]. In the first phase, graduate students/post-doctoral fellows with a background in clinical psychology undergo training to understand the process of labeling. As the goal of the process is to annotate perceived engagement by external observers, annotators were asked to label for “How engaged does the subject appear?”. Annotators also formulated the guidelines on how to perceive engagement. The guidelines included if the subject showed expressions like

66 Figure 5.2: CARMA tool: The figure above illustrates the tool used for obtaining expert annotations from trained psychologists on EASE Dataset [66]. expert raters are presented with a video feed of trauma subject (shown above) undertaking a particular task. Annotators provide a rating between -100 (very disengaged) to +100 (very engaged) to record the perceived engagement levels of the trauma subject. The ratings are recorded as continuous annotations sampled on a per second basis. Note: the face of the subject is blurred due to IRB restrictions.

“brow furrow”, “squinting focus”, “fast reading”, “head scanning” etc. were signs that the subject is engaged.

Similarly, signs like “gaze avoidance”, “fidgety movement”, “dropping eyes”, “closing eyes when tasks do not require” etc. were signs of the subject being disengaged. These guidelines were created by experienced clinical psychologists who work with trauma subjects. Following this familiarization process, annotators rated engagement levels between -100 (very disengaged) to 100 (very engaged). The annotations were obtained by post-analysis of video sequences (rather than live annotations) due to ease of collection.

We ensured that not all annotators rate same set of videos to avoid annotator perceived engagement bias.

The annotations were obtained every second. When labeling videos, the audio was turned off, and raters were instructed to label engagement based on appearance. As can be inferred, this is a cumbersome, costly and painstakingly slow process due to challenges associated with recruitment of expert annotators, training and labeling large corpus of data. In this work, we use 6K+ annotations from a total of 183K+ frames (see Table

5.1).

67 Dataset Details No. of Instance Subjects 54 Self-Reports 204 Videos 65 Frames 183.6K Annotations 6120 Distribution of Self-Reports Very Engaged - 5 51 Engaged - 4 94 Neutral - 3 50 Disengaged - 2 5 Very Disengaged - 1 4

Table 5.1: Data distribution from EASE subset: User face videos and external annotations were available from 54 trauma subjects. There were a total of 64 videos from both Session1 and 2 combined. In all our experiments we use 30 seconds of video segment prior to engagement response making a total of 183.6K video frames at 30fps. expert raters annotated the videos on a per second basis resulting in 6120 annotations used in our work. Engagement responses were also obtained from 204 self-reports on a scale of 1-5 from 1 being “Very Disengaged” to 5 “Very Engaged”. The bottom part of the table shows distribution of responses based on self-reported user engagement.

5.5 Machine Learning Models

We now describe the methodology adopted to develop models trained from data derived from expert raters and machines (facial action units). Training data from expert raters was obtained from engagement ratings, as described in the previous section. The data is obtained at 1 rating per second. The machine derived data for comparison was obtained by extracting facial action units from video frames of EASE subset. We use the recent work on OpenFace [11] proposed by Baltrusaitis etal.to extract facial action units. In our prior work on contextual engagement [66] and mood-aware engagement [67], the AUs extracted from OpenFace consisted of both intensity-based and presence-based AUs, making a total of 20 feature dimensions. For all automated models in this work, we focus only on intensity-based AUs to reduce the effect of combining categorical and numeric features. This reduces our input feature dimensions from 20 to 14. Intensity-based AUs are generated from OpenFace on a 0 to 5 point scale (not present to present with maximum intensity). The list of AUs used in this work are as follows: Inner Brow Raiser, Outer Brow Raiser, Brow Lowerer, Upper Lid Raiser, Cheek

Raiser, Nose Wrinkler, Upper Lip Raiser, Lip Corner Puller, Dimpler, Lip Corner Depressor, Chin Raiser, Lip

Stretcher, Lips Part, Jaw Drop. AUs are extracted at 30 fps.

68 5 AU01 AU02 0

5 AU04 AU05 0

5 AU06 AU09 0

5 AU10 AU12 0

5 AU14 AU15 0

5 AU17 AU20 0

5 AU25 AU26 0

80

60

40 Engagement Level Rater 1 Rater 2 20 Rater 3 Average Self Report 0 0 50 100 150 200 250 300 350 400 Time in Seconds

600

500

400

300 Variance

200

100

0 0 50 100 150 200 250 300 350 400 Time in Seconds

Figure 5.3: The figure above shows the automated AUs, rater annotations and subject self-reports for a single subject from the EASE dataset. The plot 5.3 shows 14 intensity based action units extracted from OpenFace. The y-axis is the intensity of the AU. The plot 5.3 shows perceived engagement from 3 expert raters and their average. Notice the vertical dotted line in the plot is user self-report. Since the self-reports are on a scale of 1-5, which is different from the expert rater annotation range of -100 to +100, we cannot plot them in the same manner. All three self-reports submitted by the subject in the plot presented above were “5” and the video segment is from a highly engaged subject. The variance of the three raters are also shown in the plot 5.3 . The x-axis in all plots is the time in seconds. These plots are obtained from an 7.3 minute video segment. Notice how the inter-rater variance changes radically. 69 We represent both human annotations xHA and action units xAU obtained from facial data as a series of inputs and associated subject engagement self-reports y as target outputs. Hence, we model it as a sequence learning problem and develop recurrent neural network based learning framework. We conduct a number of experiments (as described below) to gain deeper insight into the relationship between models learned from machine extracted features and expert human ratings. In order to have a fair comparison of all four models mentioned below, we use only 30-second segments prior to self-reported engagement for feature extraction.

Raters Average: As noted earlier, for each video three ratings were obtained. We compute average over the three ratings and train a sequential RNN model as a regressor over self-reported engagement levels to learn temporal dependencies. Following this process, we obtain model learned from expert annotations.

Raters Average + Variance: We extend the Raters Average model by augmenting it with variance computed over each instant, along with the average over annotations. Similarly, we learn a multi-input (average

& variance) single output RNN model to predict self-reported engagements. We term this approach as Raters

Average + Variance.

Automated AUs: We use facial action units extracted from video frames of trauma subjects as multi- dimensional inputs and train sequential RNN model as a regressor over self-reported engagement levels. The model is trained to integrate long-term temporal variations in facial expressions via facial action units. We term this approach as Automated AUs.

Facial Average and Variance Estimations (FAVE): We extend the Automated AUs based approach through a multi-step approach for engagement self-report prediction. In the first step, we use multi-dimensional action units to predict continuous ratings obtained from expert annotations, more specifically ratings average and variance. We train RNN models to learn temporal dependencies to jointly predict expert raters average and variance. In the second step, we use “machine estimated” human annotations (average and variance ratings obtained from the first step) and train a regressor over engagement self-reports for each subject.

70 SRTR

Average RA Self-­‐Report 30 x 1 Model Predict Annotaons Train

SRTR

Average & RAV Self-­‐Report Variance Model Predict Annotaons 30 x 2 Train SRTR

Acon AUs Self-­‐Report Units Model Predict 900 x 14 Train Facial Acon Units

AverageTRAIN 30 x 1 Average SRTR VarianceTRAIN AVG Predict 30 x 1 30 x 1 Model Predict Annotaons Train RAV Self-­‐Report Model Predict Acon Train VAR Predict Predict Units Model Variance Train 900 x 14 30 x 1 Facial Acon Units FAVE Model

Figure 5.4: Machine Learning Models: The four machine learning models are illustrated in the figure above along with the input and output dimensionality for each. The input to all models are obtained from 30 second video sequences of 30fps for each self-report (204). The two models on the top, RA and RAV are trained with expert annotations using rater’s average and rater’s average & variance respectively. The expert annotations are on a per second basis. The AUs model was trained with automated intensity based AUs extracted from OpenFace on a frame level. The FAVE model was a two step model which learned to estimate expert annotators average and variance from automated AUs and used the estimated average and variance for final predictions. The final predictions for all four models are user self-reports.

5.6 Model Training/Tuning

Deep learning models like RNN have several hyper-parameters that can impact performance. We analyze three basic hyper-parameters namely batch-size, number of epochs and learning rate. Selection of batch size is especially important because of the EASE subset engagement self-reports data distribution mentioned in Table

5.1. A small batch-size selection may learn subject-specific bias during training, and a large batch size might cause overfitting on the entire dataset. Similarly, if the learning rate is small the learning algorithm may get stuck in local minima, and if too high then the algorithm may bypass the local minima and never converge. In

71 order to find an effective batch size and learning rate, we first split the 204 self-reports randomly to create a training set comprising of 90% of the total samples (184) with the remaining samples in the validation set. We perform a grid search on all four learning models for batch sizes of 20,40,60 and 80 along with learning rates ranging from 1,0.1,0.01,0.001 and 0.0001. We created convergence plots with the training and validation L2 losses for 1500 epochs. For the three models, Raters Average Raters Average + Variance and Automated

AUs mentioned in Section 5.5, we found that the algorithms converged for a batch size of 40 samples and a learning rate of 0.1. Also the training and validation losses after approx. 5-7 epochs. Since our hypothesis was to compare expert raters to machine-generated features, we keep the model parameters to be constant for all four models. We select 15 epochs for all models. At 15 epochs the L2 validation loss was minimum and stable.

The constant model parameters enable us to compare the effects of feature inputs directly rather than study algorithm effects.

400 Training Loss Validation Loss 1000 350 1500 2000 300

250

200 L2 Loss

150

100

50

0 20 40 60 80 100 No of Epochs

Figure 5.5: Model tuning for FAVE : The above plot shows the training and validation loss for 100 epochs and training batch sizes of 1000,1500 and 2000. The L2 loss stabilized at a learning rate of 0.001. Notice that the lowest training loss for all three batch sizes is approximately the same. The lowest validation loss is for a batch size of 1000. Therefore, for our FAVE model, we selected 100 epochs and reported results on a batch size of 1000.

72 As mentioned in Section 5.5, the FAVE model was a two-step process. The first step in the aforementioned model had a total of 6120 samples, i.e., the number of expert rated annotations mentioned in Table 5.1. The convergence plot for the model with varied batch sizes of 1000,1500 and 2000 and a fixed learning rate of

0.001 are shown in Figure 5.5. Notice that even though the training and validation errors do not converge for batch sizes 1500 and 2000, the L2 loss converges for a batch size of 1000 around 15 epochs. As an initial test, we tried to jointly estimate raw average and variance of expert raters using automated AU data extracted from 30 frames i.e. 1 second. However, we noticed that the model learned the bias of the ground-truth labels.

This finding was not surprising as the engagement estimation for self-reports was an unbounded regression problem and the range of the ground-truth was high (-100 to +100). As a next step we normalized the raters average by the maximum value of expert rater annotations, such that the estimates lie in the range of -1 to +1 and variance from 0 to 1. Once all the model parameters are fine tuned and fixed, we use Leave One Subject

Out methodology to evaluate the results from all four models.

5.7 Results

Because we have very limited training/testing data, we used Leave One Subject Out (LOSO) cross-validation methodology for computing results and computed mean squared error (MSE) along with standard deviation

(SD) between the predicted engagement levels and ground-truth self-reported engagement levels. That is, for each of the four different machine learning models, we train 54 different regressors, using each to predict the results on missing sample. The results are summarized in Table 5.2.

The Raters Average model used average annotations obtained from 3 raters to predict self-reports. With an MSE of 1.1499, it was predicting, just not that well. Prediction using average expert rater data was the worst performing model in terms of MSE and SD.

Perceived observer annotations vary significantly from each other depending on how annotators perceive suggested guidelines for annotations and facial expressions. This variation was illustrated back in Figure 5.3, which shows raw annotation, average, variance and facial action units for a sample video from EASE dataset.

73 In case of the Raters Average + Variance model, the system was trained with using average expert annotations and corresponding annotation variance to predict self-reported engagement levels. The model training using rater mean and variance performs significantly better than Raters Average probably because it can use the inter-rater variance and associated uncertainty in weighting the prediction for its sequential representation. It can naturally learn to place greater emphasis on samples where the raters agree.

For the Automated AUs model, 14-dimensional inputs per frame, for 30 seconds worth of data provide a high-dimensional input for which the RNNs learned to directly predict engagement self-reports. We note that prediction using the machine extracted features perform statistically significantly better (p=0.02) than predict engagement levels when compared to expert annotators. Thus the costly and labor-intensive annotation process is may be unnecessary. We also note the reduction in SD of results suggesting the stability of predictions.

Finally, noting these improvements in engagement prediction performance from machine extracted features, and also the improvement from using mean and variance we explore combining the two ideas expecting improved performance. We developed a two-step process: first to estimate mean and variance of annotations from action units followed by using these annotations predictions (average and variance) to as inputs to predict engagement self-reports with the FAVE model. We note the major reduction in MSE and SD over both Raters

Average + Variance and Automated AUs models. Moreover, statistical evidence (p=0.025) is found using at

2-paired 1-tailed t-test that the combined model is better than the basic Automated AUs model. Why does this happen and what are the implications of these results? First, the total number of learned parameters is smaller for the two-stage model. Hence there is a lower risk of overfitting. Secondly, there is the possibility is that machines are more consistent at rating same set of facial expressions to a particular engagement level.

This consistency propagates in the average & variance of estimated annotations, which in turn leads to more stable engagement predictions. With each associated engagement level, the model is able to predict uncertainty

(variance) in the corresponding engagement levels. Hence, such two-step process as investigated for FAVE is ideal for scaling annotations for a large number of video segments without having to annotate each video by expert annotators. Furthermore, it lays a foundation to explore active learning techniques [220, 68] to aid in making the annotation process more consistent by developing annotator specific adaptive active learning [137].

74 Machine Learning Model Average MSE Standard Deviation Raters Average 1.1499 1.6610 Raters Average + Variance 0.8240 1.3583 Automated AUs 0.8131 1.2994 FAVE 0.7554 1.2976

Table 5.2: This table summarizes the results obtained from experiments described in Section 5.5 using Leave One Subject Out (LOSO) methodology. The first column represents the inputs to the temporal sequence learning model for user engagement estimation. The second and third columns show the average mean squared error and standard deviation respectively. Rows 1 and 2 above are models created from expert rater annotations and 3 and 4 from automated action units. Automated AUs shows significant reduction in error from Raters Average and Raters Average + Variance using a 2-paired tailed t-test (p=0.02).

5.8 Discussion

As noted earlier, collecting annotated data for trauma recovery applications is extremely expensive. Data collection process adds significant cognitive burden on subjects, leading to very cumbersome rater recruitment process. The annotation process involves finding a large population of subject expert annotators, training them and ensuring the quality of the data. Furthermore, due to the need of a specialized setup, costs and associated

IRB processes such data collection can be done only at approved labs. Hence, there is a need to develop systems that will learn, scale and maximize the usability from limited annotated data. While we show them in the context of a trauma recovery application, the ideas should apply to a wide range of vision-based affective computing.

In this work, we used only AUs extracted from 30 seconds prior to self-reports and also expert ratings from 30 seconds for a fair comparison. We noticed performance saturation beyond 30-second video segments with RNN networks and hence there is a need to explore more sophisticated temporal deep learning methods that can learn over longer time durations. Hierarchical and multi-scale temporal representations like the ones proposed by Chung etal.[54] are promising next steps to develop more accurate, flexible and scalable engagement estimators. Beyond representations, there is a need to develop better annotation tools that reduce the cognitive burden of annotators by taking into account rater reaction time and rater variance correlated with action units [193].

75 We presented an engagement prediction system that can learn from expert-generated engagement annota- tions and machine-generated facial data. Our results show that automated systems with extracted facial action units perform better than expert annotated data. We captured the non-agreement or confusion of raters and used it to build a more accurate engagement estimator.

Further, we enhanced the estimator with associated prediction uncertainty by incorporating predicted variance for the given video segment. For fair comparison in training, all our models used the same temporal window for data. However, we have significantly more expert labeled data. The power of the FAVE model could be advanced even further by using more training data, greater than the 30-second segments selected in this work. The increase in training data has a strong potential to create a better engagement estimator.

Surprisingly, our results also revealed that combining expert raters annotations with the automated action units in a two-step process has the potential to estimate user engagement in a better way. A possible reasoning for the enhanced performance in FAVE could be attributed to the reduction in input feature dimensions. The

Automated AUs model had an input feature dimension of 900 for each input sample. The FAVE model, on the other hand, had 30 features for a single sample which is a drastic reduction from Automated AUs Fusion˙ techniques like FAVE of estimating expert annotations from automated AUs can help scale to different datasets where only self-reports are available, decreasing the cost of training raters and annotation time [74]. Scheirer etal.[187] proposed a system for perceptual annotations to leverage the abilities of human subjects to build better machine learning systems. This work could be extended, to instead incorporate rater confusion and non-agreement in loss functions of the learning system to build more accurate engagement estimators.

76 Chapter 6

Self-Efficacy: Problem formulation and

Computation

“People who regard themselves as highly efficacious act, think, and feel differently from those who perceive

themselves as inefficacious. They produce their own future, rather than simply foretell it [14]” - Albert Bandura

6.1 Introduction

Psychologist Albert Bandura defined self-efficacy as the belief in one’s own capabilities to achieve intended

results [19]. Perceived self-efficacy can play a major role in how one approaches goals, tasks and challenges

[12]. People’s beliefs in their coping capabilities affect the amount of stress and depression they experience in

difficult situations, as well as their level of motivation. Perceived self-efficacy to exercise control over stressors

plays a central role in anxiety arousal and performance. The most effective way of creating a strong sense

of efficacy is through mastery experiences [18]. After people become convinced they have what it takes to

succeed, they persevere in the face of adversity and quickly rebound from setbacks [12].

Current gold-standard of perceived self-efficacy assessment is via questionnaires [21]. Judgments of

77 perceived self-efficacy are generally measured along three basic scales: magnitude, strength and generality

[19]. Self-efficacy magnitude, measures the difficulty level an individual feels is required to perform a certain task (e.g. easy, moderate, and hard). Self-efficacy strength, refers to the amount of conviction an individual has about performing successfully at diverse levels of difficulty. Generality of self-efficacy is the degree to which the expectation is generalized across situations. The work of Bandura further states that psychological procedures, in different forms, alter the level and strength of self-efficacy [12].

Based on the theory of SCT, mentioned in Section 1.3, Benight etal.did seminal work in the domain of trauma-recovery and defined psychometric measures of Coping Self-Efficacy for trauma (CSEt). The work of Benight etal.enabled creation of self-help website that targets mastery through foundations of SCT [26].

CSE beliefs refer to an individual’s beliefs about one’s ability to cope with external stressors, challenges and uncertainty in stressful situations [32]. By determining the beliefs a person holds regarding his or her power to affect situations, it strongly influences both the power a person actually has to face challenges competently and the choices a person is most likely to make.

CSEt beliefs have shown to be one of the strongest predictors for psychological and physical outcomes among trauma survivors. Effective self-regulation requires steadfast self-efficacy perceptions, continual goal engagement and optimal arousal. To recover from a trauma, an individual must come to terms with the experience, including effective self-regulation of arousal in order to process the memories of the event.

Engaging with a self-help web-intervention that requires confronting what has happened challenges the individual’s capacity to manage the emotional and physical arousal that accompanies this process [56]. Several studies conducted via “My Trauma Recovery” platform suggest that such self-help websites can improve mental health of individuals following traumatic exposure [203, 27].

Bandura in his triadic reciprocal causation model [15, 13], provided extensive evidence suggesting that management of physical and affective reactivity is an important source of self-efficacy perceptions. As noted earlier (in section 2.3), the EASE study consisted of multiple modules, including, relaxation and triggers. The relaxation module comprised of several videos that teach breathing techniques, progressive muscle relaxation, and positive-imagery visualization to manage tension and anxiety related to trauma. The relaxation module

78 provides users skills for managing physiological arousal, which according to SCT, should promote higher

CSEt. Similarly, the triggers module serves as a brief exposure therapy where users think about their traumatic experiences and learn ways to manage upsetting experiences like strong negative reactions to certain memories, sounds, smells, and sights. Continued practice of relaxation and triggers techniques reduce trauma symptoms.

With changes in mastery through training and general experience, efficacy beliefs either strengthen or weaken

[12].

In the following sections, we present measures for coping self-efficacy in the EASE dataset, review observations from related work, demonstrate the need for automated measurement of CSEt and formulate directions of research to develop automated measurement of self-efficacy for the application of trauma recovery.

6.2 EASE Coping Self-Efficacy Measures

Self-Efficacy as explained in Section 1.3 is measured from infrequent questionnaires for different beliefs.

In the EASE dataset, we have collected three different CSE measures namely, general coping self-efficacy

(CSEg), coping self-efficacy for trauma (CSEt) and chesney’s coping self-efficacy (CSEc). From the three aforementioned CSE measures, CSEt is the most frequent measure which is taken before and after each module for all 3 sessions, CSEg and CSEc are both taken before start of Session1 and at the end of Session3.

For trauma-recovery CSEt emerged as a focal mediator of trauma recovery [32]. CSEt questionnaire used in EASE was derived from a 20-item list, amongst which 9-items demonstrated measurement equivalence across varied trauma samples indicating the generality of post-traumatic CSE [32]. The questions in CSEt psychometric measure are as follows: For each sentence described below, please rate how capable you are to deal with thoughts or feelings that occur (or may occur) as the result of the traumatic event(s) that currently bother(s) you the most. Please rate each situation as you currently believe. “How capable am I to...”:

1. Deal with my emotions (anger, sadness, depression, anxiety) since I experienced my trauma.

2. Get my life back to normal.

3. Not “lose it” emotionally.

79 4. Manage distressing dreams or images about the traumatic experience.

5. Not be critical of myself about what happened.

6. Be optimistic since the traumatic experience.

7. Be supportive to other people since the traumatic experience.

8. Control thoughts of the traumatic experience happening to me again.

9. Get help from others about what happened.

The absolute value of CSEt ranges from a minimum value of 1 to a maximum of 7. The choices presented to subjects were : 1 = very incapable; 2 incapable; 3 = somewhat incapable; 4 = neither incapable nor capable;

5 = somewhat capable; 6 = capable and 7 = very capable. For designing our machine learning model, we use CSEt measures as the dependent attributes and independent features of engagement and arousal (refer sections 6.4 and 6.5).

6.3 Effectiveness of CSE in estimating PTSD symptom severity

Machine-learning models that estimate effects of mental health interventions currently rely on either user self-reports or measurements of user physiology. Recent advancements in the field of AI and mental health are targeting the personalization of digital interactions and emulating face-to-face psychotherapy sessions.

Digital biomarkers computed from behavioral patterns and physiology are also being explored to predict digital therapy outcomes. One such powerful psychological predictor is “Self-Efficacy”, which is known to have a widespread impact on user cognitive states and behavior. Self-Efficacy is an important measure of the effectiveness of treatment, with strong predictive power for many types of activities including health, stress, well-being, education, etc. Trauma psychologists have conducted multiple control tests to identify diverse sets of potential contributors to posttraumatic recovery. In these different multivariate analysis, Benight et al.

[26] found that CSEt, explained in Section 1.3, emerged as a focal mediator of posttraumatic recovery. This

finding of CSEt being critical to trauma-recovery makes it generalizable for mental trauma treatments. In this section, we briefly discuss our findings of CSEt as an estimator of changes in trauma symptom severity, by

80 evaluating a machine learning algorithm that estimates recovery outcomes.

Corresponding research in mental-health revealed that Galvanic Skin Response (GSR) or skin conductance

(SC) is an ideal way to monitor the autonomic nervous system [207, 217]. It is also considered a good biomarker to measure psychological stress [186] and/or cognitive load. Monitoring and modeling the SC responses can thereby aid in non-intrusive measurements of trauma severity and reduce self-reports. The automatic prediction of trauma severity will allow web-based trauma recovery treatments to automatically monitor subjects symptom severity as the treatment progresses and adapt interventions to their needs. Our aim is not to replace clinicians, but to estimate PTSD symptom-severity changes, while patients are interacting with a trauma-recovery-focused website for the potential adaptation of its content. Since the goal of adapting the trauma-recovery website is to quantify the change in PTSD symptom severity as a result of the treatment, we define our target as the difference between PTSD symptom severity scores computed from PCL-5 questionnaires (refer Appendix

A.3), which will be referred as ∆PCL scores. We focus our analysis on the first 2 sessions of the treatment, in which triggers (TR) and relaxation (RX) modules were assigned, as shown in Figure 6.1.

Session 1 Session 2

Module 1 Module 2 Module 1 Module 2 Random Assignment Random CSE-T Questionnaire CSE-T PCL-5 Questionnaire PCL-5 Questionnaire CSE-T Questionnaire CSE-T CSE-T Questionnaire CSE-T CSE-T Questionnaire CSE-T CSE-T Questionnaire CSE-T CSE-T Questionnaire CSE-T

Physiological Measurement Physiological Measurement

Figure 6.1: Self-Reports collected in EASE: In EASE, subjects interacted with the website for multiple sessions while performing self-regulation exercises as shown in the figure above. Additional data such as skin-conductance, respiration, and ECG signals were also recorded using body sensors.

In our recent work [145], we explored the feasibility of replacing CSEt questionnaires with skin con- ductance responses and also the fusion of both modalities, in order to estimate changes in trauma severity.

We evaluated our models on the EASE multimodal dataset and created PTSD symptom severity change estimators at both total and cluster-level. Due to the agreement among psychologists regarding the use of

CSEt as a valid predictor of PTSD, PTSD symptom severity models trained with CSEt questionnaires are

81 set as a baseline. We demonstrated that modeling the PTSD symptom severity change at the total-level with self-reports can be statistically significantly improved by the combination of physiology and self-reports or just skin conductance measurements (shown in Table 6.1). Our experiments showed that PTSD symptom cluster severity changes using our novel multimodal approach performed significantly better than using self-reports and skin conductance alone when extracting skin conductance features from triggers modules for avoidance, negative alterations in cognition & mood and alterations in arousal & reactivity symptoms, while it performed statistically similar for intrusion symptom (shown in Table 6.2).

Modules TR+RX TR RX

Input Data CSEt SC CSEt+SC SC CSEt+SC SC CSEt+SC Mean 294.9 135.1 133.0 179.9 138.7 116.2 149.4 Median 137.3 46.4 41.1 50.1 51.9 48.1 31.2 Std 466.1 231.7 221.7 294.4 206.9 199.4 260.6 BL p-value <0.01 <0.01 0.05 <0.01 <0.01 <0.01 MOD p-value 0.79 0.10 0.10

Table 6.1: The table above shows the MSEs comparison of trauma-severity models at a global-level trained with monomodal and multimodal data. BL p-value compares the baseline with SC or CSEt+SC models, while MOD p-value compares SC with CSEt+SC models.

The evaluation of trauma-severity models at a global-level, shown in Table 6.1, supports the feasibility of replacing CSEt questionnaires with skin conductance, since SC models significantly outperform the baseline. Similarly, the results obtained also show that our novel multimodal model, CSEt+SC, significantly outperforms the baseline. Therefore, both SC and the fusion of CSEt & SC provide significant improvements for estimating global ∆PCL over just CSEt questionnaires. However, performances are similar for SC and

CSEt+SC trauma-severity models, and there are no statistically significant differences when predicting global

∆PCL scores. In other words, either CSEt+SC or SC alone can be used to model changes in mental trauma severity at a global-level irrespective of treatment modules.

Moreover, the evaluation of monomodal and multimodal trauma-severity models when predicting symptom cluster-wise ∆PCL scores, results in Table 6.2, reveal two common trends amongst symptoms: the first one involves modeling changes in intrusion symptom, while the other one in modeling changes in severity for avoidance, negative alterations in cognition & mood, and alterations in arousal & reactivity symptoms.

For estimating changes in severity of the intrusion symptom, we notice that, although there is reduction in

82 Symptom Intrusion Avoidance Modules TR+RX TR RX TR+RX TR RX

Input Data CSEt SC CSEt+SC SC CSEt+SC SC CSEt+SC CSEt SC CSEt+SC SC CSEt+SC SC CSEt+SC Mean 35.4 17.8 19.6 30.1 21.4 19.3 21.3 15.1 6.7 5.5 16.1 4.6 10.3 8.5 Median 7.6 5.9 5.9 9.8 9.9 7.9 5.7 6.0 3.4 2.3 5.5 1.9 4.3 2.4 Std 72.4 33.2 33.1 44.7 31.0 30.3 36.9 25.4 9.9 9.1 32.5 7.8 15.1 15.1 BL p-value 0.07 0.09 0.61 0.12 0.10 0.13 <0.01 <0.01 0.80 <0.01 0.18 <0.01 MOD p-value 0.18 0.11 0.69 <0.01 <0.01 0.41

Symptom Negative Alterations in Cognition & Mood Alterations in Arousal & Reactivity Modules TR+RX TR RX TR+RX TR RX

Input Data CSEt SC CSEt+SC SC CSEt+SC SC CSEt+SC CSEt SC CSEt+SC SC CSEt+SC SC CSEt+SC Mean 55.2 23.6 22.3 40.3 23.2 31.5 23.9 30.6 13.5 12.1 23.3 12.5 14.7 13.6 Median 12.8 11.6 7.3 14.9 9.7 14.5 10.2 10.0 4.9 5.0 7.1 5.0 5.3 5.1 Std 87.7 38.1 38.0 61.3 35.7 40.6 35.5 45.3 25.9 20.6 34.9 21.1 27.6 23.2 BL p-value <0.01 <0.01 0.21 <0.01 0.03 <0.01 <0.01 <0.01 0.25 <0.01 <0.01 <0.01 MOD p-value 0.30 <0.01 0.10 0.24 <0.01 0.61

Table 6.2: The table above shows MSEs comparison of trauma-severity models at a symptom-level trained with monomodal and multimodal data. BL p-value compares the baseline with SC or CSEt+SC models, while MOD p-value compares SC with CSEt+SC models.

MSE from CSEt using either CSEt+SC or SC alone, neither SC nor our multimodal model of CSEt+SC statistically outperform the baseline, i.e. CSEt models. Furthermore, the results allow us to assert that SC and multimodal models perform statistically similarly for intrusion symptom. Hence, the replacement of CSEt for estimating changes in intrusion is a debatable choice between CSEt+SC and SC alone and thereby contingent on the treatment module being modeled. The estimation of changes in mental trauma severity for avoidance, negative alterations in cognition & mood, and alterations in arousal & reactivity symptoms show that both

SC and CSEt+SC significantly outperform the baseline for TR+RX treatment modules. Thus, for these 3 symptoms, we can state that it is feasible to replace CSEt questionnaires with skin conductance to predict changes in trauma severity.

The statistical analysis of p-values when comparing the performances between SC and our novel mul- timodal model CSEt+SC indicates that trauma-severity models trained with multimodal data significantly outperform the models trained with SC data only for TR treatment modules. Thus, we can highlight that the multimodal approach significantly outperforms the use of either SC or CSEt individually when predicting trauma severity changes for avoidance, negative alterations in cognition & mood, and alterations in arousal

& reactivity symptoms. The results highlight the importance of using self-assessments, especially, CSEt in a multimodal machine-learning system to predict continuous changes in PTSD symptom-severity. The results also emphasis the need for designing automated models to estimate psychological constructs like CSEt

83 which can aid in estimating outcomes directly. Moreover, in this analysis, SC comprised the only sensor-based measure. Other potential measures (e.g. heart rate, audio, facial expression etc) could also be used in this case, to complement automated CSEt models and therefore provide a stronger rationale behind the choice of

“Self-Efficacy” in adapting self-help websites and boost its performance.

6.4 Observations from related work

Prior to designing machine learning models, we review related research work and activities undertaken by other EASE researchers in order to establish grounds for our machine learning framework. Shoji etal.[197, 28] have demonstrated the relationship of CSEt with engagement and mood via psychological studies on subsets of EASE data. They made following key observations: a) Changes in trauma symptom severity can be predicted from changes in CSEt b) CSEt changes can predict changes in Heart Rate Variability (HRV) and

Engagement. Research from multiple domains ranging from student learning to affective computing have suggested relationship between HRV and engagement [167, 227]. These observations leads us to explore relationship between changes in engagement and arousal with CSEt.

As mentioned in Section 1.3, the web-based intervention is based on an SCT model for PTSD recovery.

Benight etal.[28] tested the assumption about the effectiveness of the trauma recovery website by analyzing the aggravation/reduction in PTSD symptoms reported by the subject before and after the experiment. They noticed that changes in self-belief of coping capabilities i.e changes in CSEt leads to changes in trauma symptoms.

They hypothesized that changes in self-efficacy before and after each module during the experimental phase of relaxation and triggers is correlated to changes in overall PTSD symptoms. They defined the null-hypothesis by stating Ho : There in no statistical significance between changes in PCL (a measure of PTSD symptom severity) and changes in CSEt after each session). They then measured the correlation between PCL and

CSEt to test for significant relationship. An early subset of the EASE data i.e. 55 trauma survivors who had completed two training sessions along with responding to PCL questionnaire at end of Session3 was used. The correlation results are obtained by analyzing two values p which is the statistical significance and β

84 quantitatively measures the relation between the two variables in the null hypothesis.

Positive change, ∆CSEt, during the second session working through the triggers module was a significant predictor of overall reduction in the PCL (β = −.376, p = .02).This effect was found only when the participant completed the relaxation module first and triggers second during Session1 (refer Figure 6.1) and triggers first and relaxation second in Session2 [28]. It suggests that the web-intervention showed significant improvement in subjects that exercised relaxation prior to working on triggers. Hence, there is a possibility that disengaging with the mental state of trauma through relaxation regime could be effective in increasing the CSE level of an individual before educating them about triggers, instead of starting with triggers in Session1.

These observations raised the question about the effect of ordering on overall CSE levels i.e. relaxation- triggers versus triggers-relaxation for Session1. There is possibility that the negative/no effect of web- intervention, as observed by ∆CSEt upon starting with triggers in Session1, could be attributed to the discomfort from learning about the pressure-points of trauma/external stressors, which might elicit arousal prior to engaging with the treatment. Hence, we proposed another hypothesis Ho : Changes in distressed mood measured using POMS (a measure of Profiles of Mood States on a 5 point scale), before and after each module is not related to changes in ∆CSEt. Results showed that when participants were assigned to the triggers module immediately in Session1, the increase in distressed mood (∆POMS) was a significant predictor of change in CSE both during that session (β = −.522, p = .00) and from Session1 to Session3

(β = −.295, p = .04) i.e. before and after experiment. Moreover, changes in distressed mood during Session2 were also found to predict changes in CSEt where greater distress during triggers was strongly correlated to

∆CSEt regardless of whether the participant faced triggers first (β = −.323, p = .054) or second (β = −.274, p = .046) [28]. This implies that changes in participants perceived distress is also indicative of changes in self-efficacy and the triggers module caused more distress irrespective of the order of the modules presented to the subjects. Hence, it leads us to conclusion that it may be possible to detect changes in mood from face video analysis and create a learning model that can predict changes in ∆CSEt.

85 6.4.1 Engagement and Self-Efficacy

Perceived self-efficacy contributes to cognitive development and functioning not just in student learning

[139, 17] but also in trauma recovery [39]. We hypothesize that the changes in self-efficacy after going through trauma recovery modules of relaxation and triggers is dependent on the engagement levels of the participants.

Subjects who experience debilitating symptoms post-trauma and are reluctant to seek trauma treatment(s), are generally disengaged with the recovery process and drop-out before the completion of the experiment i.e. before Session3.

As shown in Figure 6.2, repeated measures ANOVA indicated support for the hypothesis that CSE would be a significant predictor of engagement with a web intervention for trauma for the Triggers module (F (1,23)

= 5.24, p = .03). However, no such effect was evident for the Relaxation module (F (1,17) = .11, p = .75) [29].

Thus, greater trauma related CSE at baseline was related to increased engagement for the Triggers module, but not for the Relaxation module.

Figure 6.2: CSE as a predictor of self-reported engagement: The figure above shows the self-reported engagement at 3 time points during the 15-minute Triggers session and CSE at baseline as an independent variable.

Aforementioned analysis on a subset of EASE data [29] confirms that changes in CSEt are predictive of

86 user engagement during triggers module. However, the study did not test if changes in engagement during the web-intervention could predict changes in CSEt. Moreover, the subset of data used in this analysis is relatively less, i.e. 17 subjects for Relaxation and 23 for Triggers, than the one used for contextual engagement models in Chapter 3. Since, the subjects are free to work on any section of the module and continue with a module as long as they like, the responses to engagement self-reports vary in length. In order to deal with the variable feature size, and fixed data size (ground truth labels), one could consider exploring different weights for each response based on context and participant involvement. Future works in building self-efficacy predictors can also take into account all available data from self-reported engagement and also behavioral annotations from psychologists.

6.4.2 Heart Rate Variability and Self-Efficacy

We empirically seek out the effect of arousal with change in self-efficacy. Since we have three measures of arousal, namely ECG, GSR and RR (refer section 2.5.2), we first consider the effect of HRV (Heart Rate

Variability) on changes in CSEt. HRV is the most rapid predictor of changes in arousal levels and can be derived from face videos [206, 112, 6].

Shoji etal.[65] split the EASE subjects into two sets, Group1 - people who took relaxation first in Session1 and triggers second (reverse in Session2) and Group2 - Session1, triggers first and relaxation second (reverse in

Session2) as shown in Figure 6.1. The results for Group2 are shown on the left and Group1 on right in Figure

6.3. This experiment was conducted by trained clinical psychologists on 36 subjects [65]. The psychologists removed motion artifacts in the ECG signal manually and then computed HRV using Kubios software [210].

The results display the order effect and show that subjects who took relaxation and then triggers in Session1 had more physiological control during Session2 and displayed greater changes in CSE.

These findings of changes in CSEt with respect to changes in HRV were observed as a function of module-order. The trauma-recovery website had 6 modules, 2 of which (RX and TR) were considered for this analysis. It is however, impractical to test for all such module-order combinations and build a self-efficacy predictor with the data we currently have. The analysis of changes in CSEt as a predictor of ∆HRV , suggests

87 Figure 6.3: Changes in CSE with HRV: The study evaluated the effect of module order on changes in heart rate variability (∆HRV ) moderated by changes in coping self-efficacy (∆CSEt). A mixed ANCOVA with ∆HRV as a dependent variable,module order as a between-subjects variable, and ∆CSEt as a covariate [65]. Participants with low ∆CSEt and mean ∆CSEt had significantly different ∆HRV between Sessions1 and 2 when they completed the triggers module first in Session1. No such effect was found when the relaxation module was completed first in Session1. This finding suggests that working on the triggers module first in Session1 may result in greater parasympathetic dominance in Session2, particularly for those with low ∆CSEt and mean ∆CSEt that HRV changes for both Relaxation and Triggers irrespective of module-order. Another related study on

GSR and HRV, suggests that heart-rate and skin-conductance responses were less frequently consistent as trauma symptom severity increased [198]. For creating learning based self-efficacy model, in our recent work

[145] we used the baseline physiology collected during the 1 minute duration of quiet time (refer Section 2.3 for details) and normalized the physiology of the subjects during Triggers and Relaxation. This normalized arousal signal was then used as a feature input alongwith CSE depending on context of RX or TR to estimate symptom-severity.

Multiple techniques have been employed in the past to create multi-modal learning models in the affective computing domain. Decision-level, feature-level and hybrid fusion techniques have been explored to quantify dynamics underlying nonverbal behavioral cues, such as vocal changes, facial expressions, and body gestures, and to automatically analyze human behavior during real-world interactions[176, 7, 4]. As noted earlier, various features of arousal can be derived from physiological signals of ECG, GSR and Respiration. We can use HRV, SDNN (i.e. standard deviation of RR intervals), breaths per minute, Galvanic Skin Response (GSR) features in the form of phasic/tonic components along with estimated engagement to predict post-CSE. As per

SCT and the observations from preliminary experiments indicate that both engagement and HRV can be good

88 predictors for changes in CSEt.

6.5 Defining CSE as a Machine Learning problem

Observations from related research in coping self-efficacy by other EASE researchers suggests that those with lower CSE may have lower physiological control and reported disengagement with the critical aspects of the EASE web-intervention system. Further, they may need additional support in form of either a coach or therapist. Owing to the uncertain nature of trauma recovery which includes frequent mood-swings, there is a need to look for self-efficacy values over shorter time periods than self-reports. Neither, the absolute value of self-efficacy nor the changes in self-efficacy, are directly observable from visual inferences. While absolute value of self-efficacy is hard to measure, there is possibility to estimate changes in self-efficacy perceptions from video and physiological data. We conjecture that changes in CSE of an individual, might be derived from their behavior towards the web-intervention, while they work on Relaxation and Trigger modules. Hence, we seek out to monitor, learn and track facial expressions, engagement, mood-aware engagement and arousal during the course of interaction by trauma subjects during these tasks.

The objective of this work is to develop machine learning algorithms that use both self-reported data (voluntary response) and physiology (involuntary response) in a non-intrusive setting, by calibrating physiological arousal and visual engagement from face videos leading towards a hidden state of self-efficacy.

An important observation from the related work by other EASE researchers are the different slopes of low, mean and high ∆CSEt lines for both HRV and engagement. The three different slopes indicating changes in

CSEt, suggest that same modules have different effects on trauma subjects. For example, subjects with high value of initial CSEt reported higher engagement for Triggers module. Similarly, subjects with lower initial

CSEt had less physiological control for Group2 (Triggers followed by Relaxation) than those with higher initial CSEt. Moreover, the effectiveness of using CSEt in predicting changes in trauma symptom severity

89 was evident for triggers module only.

6.6 CSE Prediction Pipeline

Building on work done so far, we now present a framework for self-efficacy prediction based on video feeds, physiological signals, learned contextual engagement models and subject self-reports. The overall goal is to develop tools to automate self-efficacy measurement through robust use of computer vision, signal processing and deep learning techniques.

Figure 6.4: CSE prediction system: The figure above shows the prediction system for pre and post CSEt. The subset of EASE dataset used in this work comprised of 54 subjects across 65 module videos. This work analyzes a multi-output CSEt estimation pipeline based on user engagement and physiological response to the Triggers module. The underlying hypothesis of this pipeline is to determine if engagement and arousal data are sufficient for estimating CSEt or to understand the accuracy tradeoff in using only one of those modalities.

The key questions that will drive the creation of the self-efficacy predictors will be as follows:

1. How can we build models to predict absolute value of CSEt and/or changes in ∆CSEt?

2. Can self-efficacy be predicted from videos or physiological data or both?

3. How do we extend the learning based engagement models developed in Chapter 3, 4 and 5 to predict

90 CSEt?

In Section 6.4, changes in arousal and engagement have shown to vary with changes in CSEt, hence it maybe possible to use engagement and arousal as input predictors to build a machine learning model for estimating CSEt. Modeling a machine to predict an absolute value of self-efficacy is neither essential nor desirable. Changes in self-efficacy, on the other hand, are interesting to model and can aid the creation of adaptive websites. Measuring changes in self-efficacy gives an insight into changes in user beliefs during an intervention. We hypothesize that changes in self-efficacy prediction can be predicted by monitoring changes in visual engagement and physiological arousal during the intervention. Continuous prediction of CSE changes through machine learning can reduce the cognitive load of the trauma subjects.

We also noticed that changes in arousal and engagement are different for subjects with low, mean and high

∆CSEt. Subjects with high initial CSEt might have a ceiling effect and also more physiological control.

We plan to model a self-efficacy predictor that can use pre-CSEt value, engagement reports and changes in arousal to estimate post-CSEt. In doing so, we incorporate the effects of initial absolute CSEt value on user response. It is also possible to create a multi-target learning model for CSEt that learns and estimates pre-CSEt and post-CSEt simultaneously. The advantage of a multi-target model is the reduction in effects caused due to correlation of pre-CSEt and post-CSEt. It will also ensure that we learn both magnitude and change in CSEt.

6.7 EASE subset and Data distribution

In this section, we briefly overview the subset of EASE dataset used in the computational model of CSEt.

Although the subset used in this work is the same set of subjects used in Chapter 5, the video segments utilized in experiments presented in section 6.8 are from larger video segments. Section 5.1 comprised of facial video segments of 30 seconds prior to engagement self-reports, however, in this work we consider video and physiological data from entire module video. Table 6.5 shows the details of the scale and distribution of video data for CSEt prediction pipeline. Figure 6.6 shows the distribution of observed ∆CSEt. The CSEt change

91 distribution in figure 6.6 is for 54 subjects across 65 Trigger modules (includes data from both session1 and 2).

The ∆CSEt of 0 i.e. “no change” in coping self-efficacy category is small amongst the samples i.e. 14 out of

65 samples. There is a nearly balanced distribution of negative change CSEt samples, i.e. 23 out of 65 and positive change CSEt samples, 28 out of 65.

TRIGGERS 30 Number of Subjects 54 Number of Videos 65 25 Total Video length 27106 seconds 20 Range of Videos 153 to 909 seconds

Average Video length 417.015 seconds 15 Standard deviation of Videos 122.688 seconds

10 Self-Reports pre-CSEt, post-CSEt Max observed ∆CSEt 1.550 5 Min observed ∆CSEt -1.8899

Engagement scale +100 to -100 0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Physiological signals ECG, GSR, Respiration ∆CSE

Figure 6.5: EASE subset details Figure 6.6: Data distribution of CSEt

We use behavioral annotations of engagement from 6120 expert annotators as in experiments 5.1. The engagement annotations in the input features of the CSEt prediction pipeline are in the form of the FAVE, similar to the model developed in Section 5.5. We choose the FAVE model, because it is the best performing model developed in section 5.5. As mentioned earlier in table 2.1, the physiological signals were recorded at

256 samples per second and facial videos at 30 frames per second. The expert annotators engagement ratings were 1 per second on a scale of -100 to +100.

6.8 Experiments

In this section, we describe the self-efficacy estimation models trained with behavioral annotations of en- gagement and/or physiological response. As described in earlier sections of this chapter, we represent the user response to coping self-efficacy trauma assessment prior to the treatment module as pre-CSEt and after module as post-CSEt. The average of rater annotations xHAavg and variance xHAvar obtained from expert human ratings, are a series of inputs to the self-efficacy estimation model. Inputs from physiological

92 arousal are obtained from three signals: from the electrocardiograph xecg, galvanic skin response xgsr and respiration xresp. We model self-efficacy estimation as a sequence learning problem and develop a recurrent neural network based learning framework.The target outputs of the model are ypre-cse and ypost-cse as shown in Figure 6.4. We conduct a number of experiments (as described below) to gain deeper insight into the computational construct of self-efficacy and the relationship between engagement and physiological arousal to

∆CSEt.

Engagement as a CSE estimator (E): We extend the FAVE model by incorporating broader sequential data from the triggers module. Since the longest video segment in the EASE subset used for the experiments in this work is 909 seconds (refer figure 6.5), the RNN model created in this model comprises of 909 RNN units. We learn a multi-output, dual feature input RNN model to predict self-reported coping self-efficacy assessments. We represent this model in section 6.9 as E.

Arousal as a CSE estimator (A): The extraction of features from the physiological signals are described below: xecg – From the raw electrocardiograph signal collected at 256Hz, we extract instantaneous heart rate using the biosppy package [42] in beats per minute. This extraction of instantaneous heart rate ensures the sequence length of the extracted features to be the same as engagement model described above. Also, our prior work

[198] also utilizes the same extraction technique for xecg. The biosppy package extracts the R-peak location indices from the filtered ECG signal and averages the heart-rate over one second sliding windows as shown in

Figure 6.7. xgsr – The skin conductance response is broken down into the phasic and tonic components [145] using a specialized python package for processing electro dermal activity. The preprocessing functions in biosppy involve low-pass filtering, down-sampling, cutting, smoothing, and artifact-correction. We employ the continuos decomposition of skin conductance response signal for the feature extraction [123]. The extracted response signal is then averaged over 256 samples to obtain one sample per second. The obtained signal retains the original signal shape and structure as shown in Figure 6.7. xresp – The respiration signal is filtered and indices of respiration zero crossings are computed to obtain the

93 instantaneous respiration rate using the biosppy package [42]. The instantaneous respiration signal generates a breaths per second value that is included as a sequential input to the self-efficacy estimation algorithm.

Figure 6.7: Preprocessing of signals: The figure above shows the raw signals for Electrocardiogram, Respira- tion and Galvanic Skin Response for a single subject in the EASE dataset. The filtered signals are extracted by pre-processing of the raw signals. The filtered signals are then averaged over a 256 sample window to obtain the averaged signal. The y-axis in all plots is the amplitude and x-axis is time in seconds. Note that the ECG signal has been fine-scaled to show the signal character. The averaged signal for Respiration and GSR retain the shape and structure of the raw signal.

Engagement & Arousal as CSE estimators (E+A): We concatenate the engagement and arousal inputs into a combined model represented by E+A. For a fair comparison of the models E, A and E+A, the number of

RNN units in each model and the associated hyper-parameters are kept constant. The effect in accuracy and the observed mean-squared error are due to the variation in input modalities. The video segments from both

Engagement and Arousal are post-padded with zeroes to constitute a sequence length of 909 time steps for each of the 65 samples. Since the E+A model has 5 input features, xHAavg, xHAvar, xecg, xgsr and xresp, the input feature dimensions become 5*909 for each sample.

6.9 Results & Discussion

In this section, we analyze all three models namely, E, A and E+A mentioned in Section 6.8 for ypre-cse and ypost-cse estimations. We optimized the RNN for different hyper-parameters like learning rate and batch size

94 for model tuning. Figure 6.8 shows the convergence plot with the training and testing L2 norm losses for E, A and E+A models over 1000 epochs. We found that the model loss decreases for a batch size of 10 and learning rate of 0.1.

Figure 6.8: Convergence plot for CSEt estimation: The losses from all three models Engagement, Arousal and Engagement+Arousal are shown in the plot. The loss from all models is high and non-ideal. Notice that the lowest training loss for Engagement and Arousal models is approximately the same. The Engagement+Arousal model had the most number of parameters and the training loss did not decrease after approx. 650 epochs.

CSE Estimation Model Average MSE Standard Deviation Engagement 23.21915817 5.079739737 Arousal 23.69791718 8.338533503 Engagement+Arousal 22.41273623 6.304645855

Table 6.3: The table above shows the results from 10-fold cross-validation for Engagement, Arousal and Engagement+Arousal models for CSEt estimation. The Engagement+Arousal model has the lowest MSE amongst the three models. None of the model comparisons are statistically significant.

Since our hypothesis was to compare machine-generated engagement features to user physiology, we keep the model parameters to be constant for all three models. We select 600 epochs for all models where the L2 norm test loss was minimum and stable. The results from 10-fold cross-validation of the self-efficacy estimation model are shown in Table 6.3. No statistical evidence was found for preference of input features for

95 any of the models in the preliminary results. However, our visual model of engagement was comparable to the arousal model which comprised of three physiology signals of ECG, Respiration and GSR. This finding supports our hypothesis that visual estimates of engagement can aid in non-intrusive self-efficacy estimation.

The non-significant results and the high loss in the results indicate the possibility of overfitting with the deep learning model used and the limited number of samples in our dataset. Simpler models like Support Vector

Regression, employed in our recent work [145] could be explored for further analysis of the coping self-efficacy estimation models. Further evaluation could also include mood-aware models of visual engagement and their comparison to the engagement models for self-efficacy estimation.

6.10 Broader Impact

Our research on self-efficacy predictors, is targeted at Post Traumatic Stress Disorder (PTSD) recovery and a specific type of self-efficacy i.e. Coping Self-Efficacy for Trauma (CSEt). However, objective measurement of self-efficacy is applicable to a broader list of tasks in various fields, like academic performance of students in education, weight loss interventions or diet control in health, recovery treatments from cancer, behavioral parent training, fitness control, developing socially assistive robots, employment training (like public speaking, preparation of job-interviews) etc. Similar vision-based models could be developed for other psychological measures of self-efficacy, and integrated in existing self-help/self-regulated applications. Moreover, automated models that estimate coping self-efficacy magnitude and changes can aid in timely detection of Self-Regulation shifts and propose adequate coping mechanisms.The Self Regulation Shift Theory (SRST), which is an extension of SCT, demonstrated that CSE perceptions would serve as an important bifurcation factor as the recovery process unfolded, to bring the system back to a sense of personal wellness or equilibrium [31]. Thus, continuos monitoring of CSE can advance the understanding of non-linear shifts in human behavior. Another possibility is to use more contextual information like, who the user is, what they are doing, what their trauma levels are, daily habits, demographic information etc. the more we know, the better we can predict the affective state of the user. The self-efficacy estimation model proposed and developed in this thesis, can also be scaled

96 and deployed on mobile devices. Recent pre-trained models built for affect detection that used variants of

AlexNet, VGGnet, MobileNet etc [108] can also be explored and possibly fine tuned to include vision-based self-efficacy estimation.

97 Chapter 7

Conclusion and Insights

Reaching trauma populations to provide evidence-based, proactive and cost effective intervention continues to be a challenging problem and major healthcare need of the day. In achieving this goal, we proposed a scalable and person centered approach to monitor changes in user’s mental state by employing series of signal processing, facial analysis, machine learning and social cognitive theory based techniques on sensory and video data collected for trauma recovery application.

To facilitate development of scalable techniques, we collected large-scale dataset Engagement, Arousal and Self-Efficacy (EASE) dataset from 110 subjects with 8M+ video frames, large amounts of sensory data, behavioral annotations and subject self-reports. This is first of its kind dataset that can be used to study wide range of affective measures such as engagement, arousal, mood, self-efficacy and others. In Chapter 3, we showed that engagement in a given situation can be predicted through subtle facial movements and fine-grained actions such as changes in facial expressions. We formulated engagement prediction from video for trauma recovery as a sequential learning problem and proposed prediction model that incorporates context and temporal variations of input features. We also demonstrated that engagement was context-sensitive. Further, we showed that subject’s current mood might have an effect on engagement levels with the task at hand and empirically verified it. We presented mood prediction as a learning problem and use the predicted mood-scores to aid our contextual engagement pipeline. In Chapter 5, we introduced a novel approach that exploits the

98 inherent confusion and disagreement in raters annotations to build a scalable engagement estimation model that learns to appropriately weigh subjective behavioral cues. We showed that actively modeling the uncertainty, either explicitly from expert raters or from automated estimation with facial action units, significantly improves prediction over prediction from just the average engagement ratings.

Further, we laid the foundation for developing machine learning models for self-efficacy prediction by evaluating the effect of CSEt on symptom severity and exploring the relationship between engagement and arousal to self-efficacy.

Overall, the implemented system used ECG, galvanic skin conductance, respiration rate to measure physiological arousal. Although direct physiological measurements offer multiple advantages for wide spread and non-intrusive adoption we proposed methods using vision based signals. These input modalities were used to estimate psychometric measures such as arousal, engagement, mood to predict changes in self-efficacy.

Finally, we demonstrated that changes in self-efficacy are directly related to changes in PTSD severity prediction which can be directly used by trauma recovery applications which can aid in building autonomously adapting task-dependent website for trauma recovery. The overall system architecture is shown in Figure 7.1.

Self-support websites for trauma recovery need to adapt to the patient’s needs to empower individuals by combining sensing and machine learning to improve treatment. Prior work in adapting such websites focussed primarily on enhancing user-engagement by enhancing static web-elements. The framework for machine learning based psychometric estimation and affect recognition proposed in this work provides key elements to develop person centric, adaptive and task specific websites for trauma recovery. As immediate future work, we would like to integrate and deploy these models in the broader trauma recovery application suite to test these models in the “wild” and evaluate with regard to improvements in dropout rates in order to promote engagement and positive adaptation. Finally, the work provides initial foundations to operationalize self-regulation within sessions related to web-intervention system in predicting successful outcomes.

Self-efficacy is an important predictor for a wide range of skills, but to date changes in it have been non-observable. In this work we showed various psychometric measures can be estimated from raw signals, demonstrated their relationship and showed that changes in self-efficacy can be inferred from measurements

99 of facial expression and physiological data. This work provides first step in understanding person-specific web-based interventions for traumatic stress. Ubiquity of mobile phones with wide ranging health sensors and camera applications provides appropriate platforms for such technology for wide-spread deployment and adoption. Further, the methods employed in this work can be applied to wide range of areas beyond trauma recovery to wide ranging areas such as gaming, education, career development, self-learning and others.

Figure 7.1: System Diagram for Trauma Symptom Severity Prediction: This effort used real-time multi-sensory data to develop personalize adaptive treatment for mental trauma recovery. During the course of this work we made several important observations: Firstly engagement is highly contextual and depends a lot on task at hand. We found that it was possible to model long term psychometric measures such as mood and had direct effect on engagement prediction. We further found that modeling uncertainty, either explicitly from expert raters or from automated estimation with AU, significantly improves prediction over prediction from just the average engagement ratings. Finally, we developed a framework for estimating subject self-efficacy from videos. During treatment, the system will combine sensory measures, machine learned models with questionnaire data to select, then adapt, a learned model and use that to affect the treatment. This research makes first steps towards an adaptive person-centered system to provide continuous estimations of trauma-related measures during web-intervention

100 Appendices

101 Appendix A

Psychometric Tests & Questionnaires

A.1 Profile of Mood States-Short Form : EASE version

Please describe how you feel right now by indicating the most appropriate choice after each of the sentences below:

102 Not at all Little Bit Moderately Quite a bit Extremely Tense      Angry      Wornout      Unhappy      Lively      Confused      Peeved      Sad      Active      On Edge      Grouchy      Blue      Energetic      Hopeless      Uneasy      Restless      Unable to      Concentrate Fatigued      Annoyed      Discouraged      Resentful      Nervous      Miserable      Cheerful     

103 A.2 Coping Self-Efficacy for Trauma

For each sentence described below, please rate how capable you are to deal with thoughts or feelings that occur (or may occur) as the result of the traumatic event(s) that currently bother(s) you the most. Please rate each situation as you currently believe to the following scale: 1 = very incapable; 2 incapable; 3 = somewhat incapable; 4 = neither incapable nor capable; 5 = somewhat capable; 6 = capable; 7 = very capable. “How capable am I to...”:

• Deal with my emotions (anger, sadness, depression, anxiety) since I experienced my trauma.

• Get my life back to normal.

• Not “lose it” emotionally.

• Manage distressing dreams or images about the traumatic experience.

• Not be critical of myself about what happened.

• Be optimistic since the traumatic experience.

• Be supportive to other people since the traumatic experience.

• Control thoughts of the traumatic experience happening to me again.

• Get help from others about what happened.

A.3 PTSD Checklist (PCL-5 Questionnaire)

Below is a list of problems that people sometimes have in response to a very stressful experience. Please rate each problem carefully to indicate how much you have been bothered by that problem in the past month according to the following scale: 0 = not at all,1 = a little bit, 2 = moderately, 3 = quite a bit, 4 = extremely.“In the past month, how much were you bothered by..:”

• Repeated, disturbing, and unwanted memories of the stressful experience?

104 • Repeated, disturbing dreams of the stressful experience?

• Suddenly feeling or acting as if the stressful experience were actually happening again (as if you were

actually back there reliving it)?

• Feeling very upset when something reminded you of the stressful experience?

• Having strong physical reactions when something reminded you of the stressful experience (for example,

heart pounding, trouble breathing, sweating)?

• Avoiding memories, thoughts, or feelings related to the stressful experience?

• Avoiding external reminders of the stressful experience (for example, people,places, conversations,

activities, objects, or situations)?

• Trouble remembering important parts of the stressful experience?

• Having strong negative beliefs about yourself, other people, or the world (for example, having thoughts

such as: I am bad, there is something seriously wrong with me, no one can be trusted, the world is

completely dangerous)?

• Blaming yourself or someone else for the stressful experience or what happened after it?

• Having strong negative feelings such as fear, horror, anger, guilt, or shame?

• Loss of interest in activities that you used to enjoy?

• Feeling distant or cut off from other people?

• Trouble experiencing positive feelings (for example, being unable to feel happiness or have loving

feelings for people close to you)?

• Irritable behavior, angry outbursts, or acting aggressively?

• Taking too many risks or doing things that could cause you harm?

105 • Being “superalert” or watchful or on guard?

• Feeling jumpy or easily startled?

• Having difficulty concentrating?

• Trouble falling or staying asleep?

106 Appendix B

Institutional Review Board Approval

The research conducted in this thesis involves human subjects. As part of the EASE project, the research on human subjects was approved by the University of Colorado Colorado Springs Institutional Review Board

(IRB) under protocol number 15-007 for data collection and 18-113 for analysis of data. The associated certifications were completed on February 15th, 2014 for a basic course and a refresher course on October

14th 2016. Certifications of basic courses on Human, Social and Behavioral research from Collaborative

Institutional Training Initiative (CITI) program are attached here in Section B.1 and B.2.

107 B.1 Certificate: Conduct of Research for Engineers

Completion Date 15-Feb-2014 Expiration Date N/A Record ID 12375060

This is to certify that:

Svati Dhamija

Has completed the following CITI Program course:

Responsible Conduct of Research for Engineers (Curriculum Group) Responsible Conduct of Research for Engineers (Course Learner Group) 1 - Basic Course (Stage)

Under requirements set by:

University of Colorado Colorado Springs

Verify at www.citiprogram.org/verify/?w2570df20-3577-4e32-90bd-1c9615cb9097-12375060

108 B.2 Certificate: Human, Social and Behavioral Research

Completion Date 14-Oct-2016 Expiration Date 14-Oct-2019 Record ID 21103015

This is to certify that:

Svati Dhamija

Has completed the following CITI Program course:

Human Research (Curriculum Group) Social and Behavioral Research (Course Learner Group) 2 - Refresher Course (Stage)

Under requirements set by:

University of Colorado Colorado Springs

Verify at www.citiprogram.org/verify/?w375aeb1d-7847-4c16-ad17-cfd8fc2c7676-21103015

109 Bibliography

[1] Pacifica labs. https://www.thinkpacifica.com/LastAccessed12/1/17.

[2] Ginger.io. ”http://ginger.io Last Accessed 12/1/17”.

[3] Pamela K Adelmann and Robert B Zajonc. Facial efference and the experience of emotion. Annual

review of psychology, 40(1):249–280, 1989.

[4] Hussein Al Osman and Tiago H Falk. Multimodal affect recognition: Current approaches and challenges.

In Emotion and Attention Recognition Based on Biological Signals and Images. InTech, 2017.

[5] Simon L Albrecht and Andrew Marty. Personality, self-efficacy and job resources and their associations

with employee engagement, affective commitment and turnover intentions. The International Journal

of Human Resource Management, pages 1–25, 2017.

[6] Ahmed Alqaraawi, Ahmad Alwosheel, and Amr Alasaad. Heart rate variability estimation in photo-

plethysmography signals using bayesian learning approach. Healthcare Technology Letters, 2016.

[7] Mohamed R Amer, Timothy Shields, Behjat Siddiquie, Amir Tamrakar, Ajay Divakaran, and Sek Chai.

Deep multimodal fusion: A hybrid approach. International Journal of Computer Vision, pages 1–17,

2017.

[8] Emily Anthes. Pocket psychiatry: mobile mental-health apps have exploded onto the market, but few

have been thoroughly tested. Nature, 532(7597):20–24, 2016.

110 [9] American Psychiatric Association et al. Diagnostic and statistical manual of mental disorders (DSM-

5 R ). American Psychiatric Pub, 2013.

[10] Ryan SJd Baker, Sidney K D’Mello, Ma Mercedes T Rodrigo, and Arthur C Graesser. Better to be

frustrated than bored: The incidence, persistence, and impact of learners? cognitive–affective states

during interactions with three different computer-based learning environments. International Journal of

Human-Computer Studies, 68(4):223–241, 2010.

[11] Tadas Baltrusaitis,ˇ Peter Robinson, and Louis-Philippe Morency. Openface: an open source facial

behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV),

pages 1–10. IEEE, 2016.

[12] Albert Bandura. Self-efficacy: toward a unifying theory of behavioral change. Psychological review,

84(2):191, 1977.

[13] Albert Bandura. The self system in reciprocal determinism. American psychologist, 33(4):344, 1978.

[14] Albert Bandura. Recycling misconceptions of perceived self-efficacy. Cognitive therapy and research,

8(3):231–255, 1984.

[15] Albert Bandura. Model of causality in social learning theory. In Cognition and psychotherapy, pages

81–99. Springer, 1985.

[16] Albert Bandura. Social foundations of thought and action: A social cognitive theory. Prentice-Hall, Inc,

1986.

[17] Albert Bandura. Perceived self-efficacy in cognitive development and functioning. Educational

psychologist, 28(2):117–148, 1993.

[18] Albert Bandura. Self-efficacy. Wiley Online Library, 1994.

[19] Albert Bandura. Self-efficacy: The exercise of control. Macmillan, 1997.

111 [20] Albert Bandura. Health promotion by social cognitive means. Health education & behavior, 31(2):143–

164, 2004.

[21] Albert Bandura. Guide for constructing self-efficacy scales. Self-efficacy beliefs of adolescents,

5(307-337), 2006.

[22] Albert Bandura. Self-efficacy, corsini encyclopedia of psychology, 2010.

[23] Biplab Banerjee and Vittorio Murino. Efficient pooling of image based cnn features for action recog-

nition in videos. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International

Conference on, pages 2637–2641. IEEE, 2017.

[24] Adam Beckford. Applications of emotion sensing machines. Exp Psychol, 25(1):49–59.

[25] Sean Bell, Paul Upchurch, Noah Snavely, and Kavita Bala. Material recognition in the wild with the

materials in context database. In Proceedings of the IEEE conference on computer vision and pattern

recognition, pages 3479–3487, 2015.

[26] C Benight and A Bandura. Social cognitive theory of posttraumatic recovery: the role of perceived

self-efficacy. Behaviour Research and Therapy,Elsevier, 2004.

[27] C. Benight, J Ruzek, and E Waldrep. Internet interventions for traumatic stress: A review and

theoretically based example. J. of Trauma Stress, 21(6):513–520, 2008.

[28] Charles Benight, Kotaro Shoji, CarolynYeager, Austin Mullings, Svati Dhamija, and Terry Boult.

Changes self-appraisal and mood utilizing a web-based recovery system on posttraumatic stress symp-

toms: A laboratory experiment. ISTSS, 2016.

[29] Charles Benight, Kotaro Shoji, CarolynYeager, Austin Mullings, Svati Dhamija, and Terry Boult. The

importance of self-appraisals of coping capability in predicting engagement in a web intervention for

trauma. ISRII, 2016.

112 [30] Charles C Benight, Michael H Antoni, Kristin Kilbourn, Gail Ironson, Mahendra A Kumar, Mary Ann

Fletcher, Laura Redwine, Andrew Baum, and Neil Schneiderman. Coping self-efficacy buffers psy-

chological and physiological disturbances in hiv-infected men following a natural disaster. Health

Psychology, 16(3):248, 1997.

[31] Charles C Benight, Aaron Harwell, and Kotaro Shoji. Self-regulation shift theory: A dynamic personal

agency approach to recovery capital and methodological suggestions. Frontiers in psychology, 9, 2018.

[32] Charles C Benight, Kotaro Shoji, Lori E James, Edward E Waldrep, Douglas L Delahanty, and Roman

Cieslak. Trauma coping self-efficacy: A context-specific self-efficacy measure for traumatic stress.

Psychological trauma: theory, research, practice, and policy, 7(6):591, 2015.

[33] Gary G Bennett and Russell E Glasgow. The delivery of public health interventions via the internet:

actualizing their potential. Annual review of public health, 30:273–292, 2009.

[34] Nigel Bosch, Yuxuan Chen, and Sidney D?Mello. It?s written on your face: detecting affective

states from facial expressions while learning computer programming. In International Conference on

Intelligent Tutoring Systems, pages 39–44. Springer, 2014.

[35] Hennie Brugman, Albert Russel, and Xd Nijmegen. Annotating multi-media/multi-modal resources

with elan. In LREC, 2004.

[36] EL Bunge, B Dickter, MK Jones, G Alie, A Spear, and R Perales. Behavioral intervention technologies

and psychotherapy with youth: A review of the literature. Current Psychiatry Reviews, 12(1):14–28,

2016.

[37] Virginia H Burney. Applications of social cognitive theory to gifted education. Roeper Review,

30(2):130–139, 2008.

[38] Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi,

and Emily Mower Provost. Msp-improv: An acted corpus of dyadic interactions to study emotion

perception. IEEE Transactions on Affective Computing, 8(1):67–80, 2017.

113 [39] Victoria Bustamante, Thomas A Mellman, Daniella David, and Ana I Fins. Cognitive functioning and

the early development of ptsd. Journal of traumatic stress, 14(4):791–797, 2001.

[40] Rafael A Calvo, Karthik Dinakar, Rosalind Picard, and Pattie Maes. Computing in mental health. In

Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems,

pages 3438–3445. ACM, 2016.

[41] Elizabeth Camilleri, Georgios N Yannakakis, and Antonios Liapis. Towards general models of player

affect. In Affective Computing and Intelligent Interaction (ACII), 2017 International Conference on,

2017.

[42] Carlos Carreiras, Ana Priscila Alves, Andre´ Lourenc¸o, Filipe Canento, Hugo Silva, Ana Fred, et al.

BioSPPy: Biosignal processing in Python, 2015–. [Online; accessed ¡today¿].

[43] Ginevra Castellano, Iolanda Leite, Andre Pereira, Carlos Martinho, Ana Paiva, and Peter W McOwan.

Detecting engagement in hri: An exploration of social and task-based context. In Privacy, Security, Risk

and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social

Computing (SocialCom), pages 421–428. IEEE, 2012.

[44] Ginevra Castellano, Andre´ Pereira, Iolanda Leite, Ana Paiva, and Peter W McOwan. Detecting user

engagement with a robot companion using task and social interaction-based features. In ICMI, pages

119–126. ACM, 2009.

[45] Oya Celiktutan, Efstratios Skordos, and Hatice Gunes. Multimodal human-human-robot interactions

(mhhri) dataset for studying personality and engagement. IEEE Transactions on Affective Computing,

2017.

[46] Wei-Yi Chang, Shih-Huan Hsu, and Jen-Hsien Chien. Fatauva-net: An integrated deep learning

framework for facial attribute recognition, action unit detection, and valence-arousal estimation. In

Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages

1963–1971. IEEE, 2017.

114 [47] Mounir Chennaoui, Clement´ Bougard, Catherine Drogou, Christophe Langrume, Christian Miller,

Danielle Gomez-Merino, and Fred´ eric´ Vergnoux. Stress biomarkers, mood states, and sleep during

a major competition:“success” and “failure” athlete’s profile of high-level swimmers. Frontiers in

physiology, 7, 2016.

[48] Anoop Cherian, Piotr Koniusz, and Stephen Gould. Higher-order pooling of cnn features via kernel

linearization for action recognition. In Applications of Computer Vision (WACV), 2017 IEEE Winter

Conference on, pages 130–138. IEEE, 2017.

[49] Katie E Cherry, Loren D Marks, Rachel Adamek, and Bethany A Lyon. Traumatic stress and long-term

recovery. 2015.

[50] Kyunghyun Cho, Bart Van Merrienboer,¨ Dzmitry Bahdanau, and Yoshua Bengio. On the properties of

neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

[51] Kyunghyun Cho, Bart Van Merrienboer,¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger

Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical

machine translation. arXiv preprint arXiv:1406.1078, 2014.

[52] W Chu, F De la Torre, and J Cohn. Learning spatial and temporal cues for multi-label facial action unit

detection. Automatic Face and Gesture Conference, 2017.

[53] Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F Cohn. Selective transfer machine for personalized

facial expression analysis. IEEE transactions on pattern analysis and machine intelligence, 39(3):529–

545, 2017.

[54] Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks.

arXiv preprint arXiv:1609.01704, 2016.

[55] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of

gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

115 [56] Roman Cieslak, Charles C Benight, Anna Rogala, Ewelina Smoktunowicz, Martyna Kowalska,

Katarzyna Zukowska, Carolyn Yeager, and Aleksandra Luszczynska. Effects of internet-based self-

efficacy intervention on secondary traumatic stress and secondary posttraumatic growth among health

and human services professionals exposed to indirect trauma. Frontiers in Psychology, 7, 2016.

[57] Carleton Coffrin, Linda Corrin, Paula de Barba, and Gregor Kennedy. Visualizing patterns of student

engagement and performance in moocs. In Proceedings of the fourth international conference on

learning analytics and knowledge, pages 83–92. ACM, 2014.

[58] Anat Cohen. Analysis of student activity in web-supported courses as a tool for predicting dropout.

Educational Technology Research and Development, pages 1–20, 2017.

[59] J Cohn. Foundations of human computing: facial expression and emotion. ICMI, pages 233–238, 2006.

[60] J Cohn, T Kruez, I Matthews, Y Yang, M Nguyen, M Padilla, F Zhou, and F De la Torre. Detecting

depression from facial actions and vocal prosody. Affective Computing and Intelligent Interaction and

Workshops,IEEE, 2009.

[61] Deborah R Compeau and Christopher A Higgins. Application of social cognitive theory to training for

computer skills. Information systems research, 6(2):118–143, 1995.

[62] M Cox, J Nuevo-Chiquero, JM Saragih, and S Lucey. Csiro face analysis sdk. Brisbane, Australia,

2013.

[63] Fernando De la Torre, Wen-Sheng Chu, Xuehan Xiong, Francisco Vicente, Xiaoyu Ding, and Jeffrey

Cohn. Intraface. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International

Conference and Workshops on, volume 1, pages 1–8. IEEE, 2015.

[64] Lyne Desrosiers, Micheline Saint-Jean, and Lise Laporte. Model of engagement and dropout for

adolescents with borderline personality disorder. Sante mentale au Quebec, 41(1):267–290, 2017.

[65] Amanda Devane, Kotaro Shoji, Terrance Boult, and Charles C. Benight. The role of coping self-efficacy

on parasympathetic response to an online intervention. ISTSS, 2016.

116 [66] Svati Dhamija and Terrance Boult. Exploring contextual engagement for trauma recovery. CVPR

Workshop on Deep Affective Learning and Context Modelling, 2017.

[67] Svati Dhamija and Terranee E Boult. Automated mood-aware engagement prediction. In Affective

Computing and Intelligent Interaction (ACII), 2017 Seventh International Conference on, pages 1–8.

IEEE, 2017.

[68] Roberto Di Salvo, Concetto Spampinato, and Daniela Giordano. Generating reliable video annotations

by exploiting the crowd. In Applications of Computer Vision (WACV), 2016 IEEE Winter Conference

on, pages 1–8. IEEE, 2016.

[69] Anja Dieckmann and Matthias Unfried. Writ large on your face: Observing emotions using automatic

facial analysis. GfK Marketing Intelligence Review, 6(1):52–58, 2014.

[70] Richard A Dienstbier. Arousal and physiological toughness: implications for mental and physical

health. Psychological review, 96(1):84, 1989.

[71] Alexiei Dingli and Andreas Giordimaina. Webcam-based detection of emotional states. The Visual

Computer, 33(4):459–469, 2017.

[72] S D’Mello, S Craig, J Sullins, and A Graesser. Predicting affective states expressed through an

emote-aloud procedure from autotutors mixed-initiative dialogue. International Journal of Artificial

Intelligence in Education, 16(1):2–28, 2006.

[73] Sidney D’Mello, Ed Dieterle, and Angela Duckworth. Advanced, analytic, automated (aaa) measurement

of engagement during learning. Educational Psychologist, 52(2):104–123, 2017.

[74] Sidney K D’Mello. On the influence of an iterative affect annotation approach on inter-observer and

self-observer reliability. IEEE Transactions on Affective Computing, 7(2):136–149, 2016.

[75] Sidney S D’Mello, Patrick Chipman, and Art Graesser. Posture as a predictor of learner’s affective

engagement. In Proceedings of the Cognitive Science Society, volume 29, 2007.

117 [76] Tom Dunne, Lisa Bishop, Susan Avery, and Stephen Darcy. A review of effective youth engagement

strategies for mental health and substance use interventions. Journal of Adolescent Health, 2017.

[77] Robert R Edwards and Jennifer Haythornthwaite. Mood swings: variability in the use of the profile of

mood states. Journal of pain and symptom management, 28(6):534, 2004.

[78] Panteleimon Ekkekakis. The measurement of affect, mood, and emotion: A guide for health-behavioral

research. Cambridge University Press, 2013.

[79] P Ekman and AJ Fridlund. Assessment of facial behavior in affective disorders. Depression and

expressive behavior, pages 37–56, 1987.

[80] Paul Ed Ekman and Richard J Davidson. The nature of emotion: Fundamental questions. Oxford

University Press, 1994.

[81] C. Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M. Martinez. Emotionet: An accurate,

real-time algorithm for the automatic annotation of a million facial expressions in the wild. In The IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[82] Mary Ellen Foster, Andre Gaschler, and Manuel Giuliani. Automatically classifying user engagement

for dynamic multi-party human–robot interaction. International Journal of Social Robotics, pages 1–16,

2017.

[83] Andrew C Gallagher and Tsuhan Chen. Estimating age, gender, and identity using first name priors. In

Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE,

2008.

[84] Autumn M Gallegos, Nicholas A Streltzov, and Tracy Stecker. Improving treatment engagement for

returning operation enduring freedom and operation iraqi freedom veterans with posttraumatic stress

disorder, depression, and suicidal ideation. The Journal of nervous and mental disease, 204(5):339–343,

2016.

118 [85] Chuanji Gao, Douglas H Wedell, Jongwan Kim, Christine E Weber, and Svetlana V Shinkareva.

Modelling audiovisual integration of affect from videos and music. Cognition and Emotion, pages 1–14,

2017.

[86] Felix A Gers, Jurgen¨ Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with

lstm. In 9th International Conference on Artificial Neural Networks: ICANN ’99. IET, 1999.

[87] Michail N Giannakos, Letizia Jaccheri, and John Krogstie. How video usage styles affect student

engagement? implications for video-based learning environments. In State-of-the-Art and Future

Directions of Smart Learning, pages 157–163. Springer, 2016.

[88] Elise L Gibbs, Andrea E Kass, Dawn M Eichen, Ellen E Fitzsimmons-Craft, Mickey Trockel, Denise E

Wilfley, and C Barr Taylor. Attention-deficit/hyperactivity disorder–specific stimulant misuse, mood,

anxiety, and stress in college-age women at high risk for or with eating disorders. Journal of American

College Health, 64(4):300–308, 2016.

[89] Jeffrey Girard. Carma: Software for continuous affect rating and media annotation. Journal of Open

Research Software, 2(1), 2014.

[90] Jeffrey M Girard. Carma: Software for continuous affect rating and media annotation. Journal of Open

Research Software, 2(1):e5, 2014.

[91] Jeffrey M Girard and Jeffrey F Cohn. A primer on observational measurement. Assessment, 23(4):404–

413, 2016.

[92] Karen Glanz, Barbara K Rimer, and Kasisomayajula Viswanath. Health behavior and health education:

theory, research, and practice. John Wiley & Sons, 2008.

[93] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.

[94] John M Gottman and Robert W Levenson. A valid procedure for obtaining self-report of affect in

marital interaction. Journal of consulting and clinical psychology, 53(2):151, 1985.

119 [95] J Grafsgaard, J Wiggins, K Boyer, E Wiebe, and J Lester. Automatically recognizing facial expression:

Predicting engagement and frustration. Educational Data Mining, 2013.

[96] Joseph Grafsgaard, Joseph B Wiggins, Kristy Elizabeth Boyer, Eric N Wiebe, and James Lester.

Automatically recognizing facial expression: Predicting engagement and frustration. In Educational

Data Mining 2013, 2013.

[97] A Graves, A Mohamed, and G Hinton. Speech recognition with deep recurrent neural networks.

ICASSP, 2013.

[98] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

2013.

[99] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent

neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference

on, pages 6645–6649. IEEE, 2013.

[100] Klaus Greff, Rupesh K Srivastava, Jan Koutn´ık, Bas R Steunebrink, and Jurgen¨ Schmidhuber. Lstm: A

search space odyssey. IEEE transactions on neural networks and learning systems, 2016.

[101] Philip J Guo, Juho Kim, and Rob Rubin. How video production affects student engagement: An

empirical study of mooc videos. In Proceedings of the first ACM conference on Learning@ scale

conference, pages 41–50. ACM, 2014.

[102] Daniel Hadar. Implicit media tagging and affect prediction from video of spontaneous facial expressions,

recorded with depth camera. arXiv preprint arXiv:1701.05248, 2017.

[103] Patricia A Hageman, Carol H Pullen, Melody Hertzog, Bunny Pozehl, Christine Eisenhauer, and Linda S

Boeckner. Web-based interventions alone or supplemented with peer-led support or professional email

counseling for weight loss and weight maintenance in women from rural communities: Results of a

clinical trial. Journal of obesity, 2017, 2017.

120 [104] Moira Haller, Ursula S Myers, Aaron McKnight, Abigail C Angkaw, and Sonya B Norman. Predicting

engagement in psychotherapy, pharmacotherapy, or both psychotherapy and pharmacotherapy among

returning veterans seeking ptsd treatment. 2016.

[105] Nik Wahidah Hashim, Mitch Wilkes, Ronald Salomon, Jared Meggs, and Daniel J France. Evaluation

of voice acoustics as predictors of clinical depression scores. Journal of Voice, 2016.

[106] Javier Hernandez, Zicheng Liu, Geoff Hulten, Dave DeBarr, Kyle Krum, and Zhengyou Zhang.

Measuring the engagement level of tv viewers. In Automatic Face and Gesture Recognition (FG), 2013

10th IEEE International Conference and Workshops on, pages 1–7. IEEE, 2013.

[107] Guillaume Heusch, Andre´ Anjos, and Sebastien´ Marcel. A reproducible study on remote heart rate

measurement. arXiv preprint arXiv:1709.00962, 2017.

[108] Charlie Hewitt and Hatice Gunes. Cnn-based facial affect analysis on mobile devices. arXiv preprint

arXiv:1807.08775, 2018.

[109] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–

1780, 1997.

[110] Nathan O Hodas, Ryan Butner, et al. How a user?s personality influences content engagement in social

media. In International Conference on Social Informatics, pages 481–493. Springer, 2016.

[111] Thomas R Insel. Assessing the economic costs of serious mental illness, 2008.

[112] Luca Iozzia, Luca Cerina, and Luca Mainardi. Relationships between heart-rate variability and pulse-

rate variability obtained from video-ppg signal using zca. Physiological Measurement, 37(11):1934,

2016.

[113] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep learning

on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 5308–5317, 2016.

121 [114] Laszl´ o´ A Jeni, Jeffrey F Cohn, and Takeo Kanade. Dense 3d face alignment from 2d videos in real-time.

In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and

Workshops on, volume 1, pages 1–8. IEEE, 2015.

[115] Laszl´ o´ A Jeni, Andras´ Lorincz,˝ Zoltan´ Szabo,´ Jeffrey F Cohn, and Takeo Kanade. Spatio-temporal

event classification using time-series kernel based structured sparsity. In European Conference on

Computer Vision, pages 135–150. Springer, 2014.

[116] Dr Rajiv Jhangiani, Dr Hammond Tarry, and Dr Charles Stangor. Principles of Social Psychology-1st

International Edition. Creative Commons Attribution 4.0 International License, 2014.

[117] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network

architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15),

pages 2342–2350, 2015.

[118] Aditya Kamath, Aradhya Biswas, and Vineeth Balasubramanian. A crowdsourced approach to student

engagement recognition in e-learning environments. In Applications of Computer Vision (WACV), 2016

IEEE Winter Conference on, pages 1–9. IEEE, 2016.

[119] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137,

2015.

[120] Christina Katsimerou, Ingrid Heynderickx, and Judith A Redi. Predicting mood from punctual emotion

annotations on videos. IEEE Transactions on Affective Computing, 6(2):179–192, 2015.

[121] Heysem Kaya, Furkan Gurpınar,¨ and Albert Ali Salah. Video-based emotion recognition in the wild

using deep transfer learning and score fusion. Image and Vision Computing, 2017.

[122] Nidhi N Khatri, Zankhana H Shah, and Samip A Patel. Facial expression recognition: A survey.

International Journal of Computer Science and Information Technologies (IJCSIT), 5(1):149–152,

2014.

122 [123] Kyung Hwan Kim, Seok Won Bang, and Sang Ryong Kim. Emotion recognition system using short-term

monitoring of physiological signals. Medical and biological engineering and computing, 42(3):419–427,

2004.

[124] Michael Kipp. Anvil-a generic annotation tool for multimodal dialogue. In Seventh European Confer-

ence on Speech Communication and Technology, 2001.

[125] Chris L Kleinke, Thomas R Peterson, and Thomas R Rutledge. Effects of self-generated facial

expressions on mood. Journal of Personality and Social Psychology, 74(1):272, 1998.

[126] David Klotz, Johannes Wienke, Julia Peltason, Britta Wrede, Sebastian Wrede, Vasil Khalidov, and

Jean-Marc Odobez. Engagement-based multi-party dialog with a humanoid robot. In Proceedings of

the SIGDIAL 2011 Conference, pages 341–343. Association for Computational Linguistics, 2011.

[127] Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj

Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap: A database for emotion analysis;

using physiological signals. IEEE Transactions on Affective Computing, 3(1):18–31, 2012.

[128] Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. Afew-va database for

valence and arousal estimation in-the-wild. Image and Vision Computing, 2017.

[129] George D Kuh. The national survey of student engagement: Conceptual and empirical foundations.

New directions for institutional research, 2009(141):5–20, 2009.

[130] F De la Torre and J Cohn. Facial expression analysis. Visual Analysis of Humans - Springer, pages

377–409, 2011.

[131] Robert LaRose and Matthew S Eastin. A social cognitive theory of internet uses and gratifications:

Toward a new model of media attendance. Journal of Broadcasting & Electronic Media, 48(3):358–377,

2004.

[132] Robert W Lent, Steven D Brown, and Gail Hackett. Toward a unifying social cognitive theory of career

and academic interest, choice, and performance. Journal of vocational behavior, 45(1):79–122, 1994.

123 [133] Michael E Levin, Steven C Hayes, Jacqueline Pistorello, and John R Seeley. Web-based self-help for

preventing mental health problems in universities: Comparing acceptance and commitment training to

mental health education. Journal of clinical psychology, 72(3):207–225, 2016.

[134] Haoxiang Li, Jonathan Brandt, Zhe Lin, Xiaohui Shen, and Gang Hua. A multi-level contextual model

for person recognition in photo albums. In Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 1297–1305, 2016.

[135] Wei Li, Farnaz Abitahi, and Zhigang Zhu. Action unit detection with region adaptation, multi-labeling

learning and optimal temporal fusing. arXiv preprint arXiv:1704.03067, 2017.

[136] X Li, J Chen, G Zhao, and M Pietikainen. Remote heart rate measurement from face videos under

realistic situations. Computer Vision and Pattern Recognition, 2014.

[137] Xin Li and Yuhong Guo. Adaptive active learning for image classification. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 859–866, 2013.

[138] Tze Wei Liew and Tan Su-Mae. The effects of positive and negative mood on cognition and motivation

in multimedia learning environment. Journal of Educational Technology & Society, 19(2):104, 2016.

[139] Elizabeth A Linnenbrink and Paul R Pintrich. The role of self-efficacy beliefs instudent engagement

and learning intheclassroom. Reading &Writing Quarterly, 19(2):119–137, 2003.

[140] Joda Lloyd, Frank W Bond, and Paul E Flaxman. Work-related self-efficacy as a moderator of the

impact of a worksite stress management training intervention: Intrinsic work motivation as a higher

order condition of effect. Journal of occupational health psychology, 22(1):115, 2017.

[141] Bote Lorenzo, L Miguel, and Eduardo Gomez´ Sanchez.´ Predicting the decrease of engagement

indicators in a mooc. 2017.

[142] Aleksandra Luszczynska and R Schwarzer. Social cognitive theory. Predicting health behaviour,

2:127–169, 2005.

124 [143] Ludo Maat and Maja Pantic. Gaze-x: Adaptive, affective, multimodal interface for single-user office

scenarios. In Artifical Intelligence for Human Computing, pages 251–271. Springer, 2007.

[144] D Macea, K Gajos, Y Daglia Calil, and F Fregni. The efficacy of web-based cognitive behavioral

interventions for chronic pain: a systematic review and meta-analysis. J. of Pain, 11(10):917–929, 2010.

[145] Adria Mallol-Ragolta, Svati Dhamija, and Terrance E Boult. A multimodal approach for predict-

ing changes in ptsd symptom severity. In Proceedings of the 2018 on International Conference on

Multimodal Interaction, pages 324–333. ACM, 2018.

[146] S. U. Marks and R. Gersten. Engagement and disengagement between special and general educators: An

application of miles and huberman’s cross-case analysis. Learning Disability Quarterly, 21(1):32–56,

1998.

[147] Rene´ Mauer, Helle Neergaard, and Anne Kirketerp Linstad. Self-efficacy: Conditioning the en-

trepreneurial mindset. In Revisiting the Entrepreneurial Mind, pages 293–317. Springer, 2017.

[148] Daniel McDuff. New methods for measuring advertising efficacy. Digital Advertising: Theory and

Research, 2017.

[149] Daniel McDuff, Rana El Kaliouby, Jeffrey F Cohn, and Rosalind W Picard. Predicting ad liking

and purchase intent: Large-scale analysis of facial responses to ads. IEEE Transactions on Affective

Computing, 6(3):223–235, 2015.

[150] Daniel McDuff, Rana El Kaliouby, and Rosalind W Picard. Crowdsourcing facial responses to online

videos. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on,

pages 512–518. IEEE, 2015.

[151] Daniel Jonathan McDuff. Crowdsourcing affective responses for predicting media effectiveness. PhD

thesis, Massachusetts Institute of Technology, 2014.

125 [152] Scott McIntosh, Tye Johnson, Andrew F Wall, Alexander V Prokhorov, Karen Sue Calabro, Duncan

Ververs, Vanessa Assibey-Mensah, and Deborah J Ossip. Recruitment of community college students

into a web-assisted tobacco intervention study. JMIR research protocols, 6(5), 2017.

[153] Varun S Mehta, Malvika Parakh, and Debasruti Ghosh. Web based interventions in psychiatry: An

overview. International Journal of Mental Health & Psychiatry, 2015, 2016.

[154] Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classifica-

tion. arXiv preprint arXiv:1706.06905, 2017.

[155] Mark J Millan, Jean-Michel Rivet, and Alain Gobert. The frontal cortex as a network hub controlling

mood and cognition: Probing its neurochemical substrates for improved therapy of psychiatric and

neurological disorders. Journal of Psychopharmacology, 30(11):1099–1128, 2016.

[156] David C Mohr, Michelle Nicole Burns, Stephen M Schueller, Gregory Clarke, and Michael Klinkman.

Behavioral intervention technologies: evidence review and recommendations for future research in

mental health. General hospital psychiatry, 35(4):332–338, 2013.

[157] H Monkaresi, P Bosch, R Calvo, and S D’Mello. Automated detection of engagement using video-based

estimation of facial expressions and heart rate. IEEE Trans. on Affective Computing, 2017.

[158] Hamed Monkaresi, Nigel Bosch, Rafael Calvo, and Sidney D’Mello. Automated detection of engage-

ment using video-based estimation of facial expressions and heart rate.

[159] Hamed Monkaresi, Nigel Bosch, Rafael A Calvo, and Sidney K D’Mello. Automated detection of

engagement using video-based estimation of facial expressions and heart rate. IEEE Transactions on

Affective Computing, 8(1):15–28, 2017.

[160] Robert R Morris, Stephen M Schueller, and Rosalind W Picard. Efficacy of a web-based, crowdsourced

peer-to-peer cognitive reappraisal platform for depression: Randomized controlled trial. Journal of

medical Internet research, 17(3), 2015.

126 [161] Cecily Morrison and Gavin Doherty. Analyzing engagement in a web-based intervention platform

through visualizing log-data. Journal of medical Internet research, 16(11), 2014.

[162] Wenxuan Mou, Hatice Gunes, and Ioannis Patras. Automatic recognition of emotions and membership

in group videos. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Workshops, June 2016.

[163] Barbara M Murphy, Rosemary O Higgins, Lyndel Shand, Karen Page, Elizabeth Holloway, Michael R

Le Grande, and Alun C Jackson. Improving health professionals? self-efficacy to support cardiac

patients? emotional recovery: the ?cardiac blues project? European Journal of Cardiovascular Nursing,

16(2):143–149, 2017.

[164] Elizabeth Murray. Web-based interventions for behavior change and self-management: potential,

pitfalls, and progress. Medicine 2.0, 1(2):e3, 2012.

[165] Frederik Nagel, Reinhard Kopiez, Oliver Grewe, and Eckart Altenmuller.¨ Emujoy: Software for

continuous measurement of perceived emotions in music. Behavior Research Methods, 39(2):283–290,

2007.

[166] Quang Nguyen. Efficient learning with soft label information and multiple annotators. PhD thesis,

University of Pittsburgh, 2014.

[167] Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. Continuous prediction of spontaneous affect from

multiple cues and modalities in valence-arousal space. IEEE Transactions on Affective Computing,

2(2):92–105, 2011.

[168] Jaclyn Ocumpaugh Nigel Bosch, Sidney K D?Mello, Ryan S Baker, and Valerie Shute. Using video to

automatically detect learner affect in computer-enabled classrooms? 2010.

[169] H O’Brien and E Toms. The development and evaluation of a survey to measure user engagement.

Journal of the American Society for Information Science and Technology, 61(1):50–69, 2010.

127 [170] Sarah Oetken, Katharina D Pauly, Ruben C Gur, Frank Schneider, Ute Habel, and Anna Pohl. Don’t

worry, be happy-neural correlates of the influence of musically induced mood on self-evaluation.

Neuropsychologia, 100:26–34, 2017.

[171] Carol Opdebeeck, Fiona E Matthews, Yu-Tzu Wu, Robert T Woods, Carol Brayne, and Linda Clare.

Cognitive reserve as a moderator of the negative association between mood and cognition: evidence

from a population-representative cohort. Psychological Medicine, pages 1–11, 2017.

[172] Rachel Pemberton and Matthew D Fuller Tyszkiewicz. Factors contributing to depressive mood states

in everyday life: a systematic review. Journal of affective disorders, 200:103–110, 2016.

[173] Donald W Pfaff. Brain arousal and information theory. Harvard University Press, 2006.

[174] Rosalind W. Picard, Elias Vyzas, and Jennifer Healey. Toward machine emotional intelligence: Analysis

of affective physiological state. IEEE transactions on pattern analysis and machine intelligence,

23(10):1175–1191, 2001.

[175] Rosalind W. Picard, Elias Vyzas, and Jennifer Healey. Toward machine emotional intelligence: Analysis

of affective physiological state. IEEE transactions on pattern analysis and machine intelligence,

23(10):1175–1191, 2001.

[176] Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. A review of affective computing: From

unimodal analysis to multimodal fusion. Information Fusion, 37:98–125, 2017.

[177] Andrew Pulver and Siwei Lyu. Lstm with working memory. arXiv preprint arXiv:1605.01988, 2016.

[178] O.V. Ramana Murthy and Roland Goecke. Ordered trajectories for large scale human action recognition.

In The IEEE International Conference on Computer Vision (ICCV) Workshops, December 2013.

[179] Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni,

and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322,

2010.

128 [180] Fabien Ringeval, Bjorn¨ Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi, Denis Lalanne, Roddy

Cowie, and Maja Pantic. Av+ ec 2015: The first affect recognition challenge bridging across audio,

video, and physiological data. In Proceedings of the 5th International Workshop on Audio/Visual

Emotion Challenge, pages 3–8. ACM, 2015.

[181] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, and Denis Lalanne. Introducing the recola

multimodal corpus of remote collaborative and affective interactions. In Automatic Face and Gesture

Recognition (FG), 2013 10th IEEE International Conference and Workshops on, pages 1–8. IEEE,

2013.

[182] Dayane Ferreira Rodrigues, Andressa Silva, Joao˜ Paulo Pereira Rosa, Francieli Silva Ruiz, Amaury Wag-

ner Ver´ıssimo, Ciro Winckler, Edilson Alves da Rocha, Andrew Parsons, Sergio Tufik, and Marco Tulio´

de Mello. Profiles of mood states, depression, sleep quality, sleepiness, and anxiety of the paralympic

athletics team: A longitudinal study. Apunts. Medicina de l’Esport, 2017.

[183] Mary AM Rogers, Kelsey Lemmen, Rachel Kramer, Jason Mann, and Vineet Chopra. Internet-delivered

health interventions that work: systematic review of meta-analyses and evaluation of website availability.

Journal of medical Internet research, 19(3), 2017.

[184] Hanan Salam, Oya Celiktutan, Isabelle Hupont, Hatice Gunes, and Mohamed Chetouani. Fully

automatic analysis of engagement and its relationship to personality in human-robot interactions. IEEE

Access, 5:705–721, 2017.

[185] Akane Sano, Z Yu Amy, Andrew W McHill, Andrew JK Phillips, Sara Taylor, Natasha Jaques,

Elizabeth B Klerman, and Rosalind W Picard. Prediction of happy-sad mood from daily behaviors

and previous sleep history. In 2015 37th Annual International Conference of the IEEE Engineering in

Medicine and Biology Society (EMBC), pages 6796–6799. IEEE, 2015.

[186] Akane Sano, Sara Taylor, Andrew W McHill, Andrew JK Phillips, Laura K Barger, Elizabeth Klerman,

and Rosalind Picard. Identifying objective physiological markers and modifiable behaviors for self-

129 reported stress and mental health status using wearable sensors and mobile phones: Observational study.

Journal of medical Internet research, 20(6):e210, 2018.

[187] Walter J Scheirer, Samuel E Anthony, Ken Nakayama, and David D Cox. Perceptual annotation:

Measuring human vision to improve computer vision. IEEE transactions on pattern analysis and

machine intelligence, 36(8):1679–1686, 2014.

[188] Walter J Scheirer, Neeraj Kumar, Karl Ricanek, Peter N Belhumeur, and Terrance E Boult. Fusing

with context: a bayesian approach to combining descriptive attributes. In Biometrics (IJCB), 2011

International Joint Conference on, pages 1–8. IEEE, 2011.

[189] S Scherer, G Lucas, J Gratch, A Rizzo, and L P Morency. Self-reported symptoms of depression

and ptsd are associated with reduced vowel space in screening interview. IEEE Trans. on Affective

Computing, 2016.

[190] Jurgen Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks.

Connection Science, 1(4):403–412, 1989.

[191] Jurgen¨ Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117,

2015.

[192] Mark Scholten, Saskia Kelders, and Lisette van Gemert-Pijnen. How persuasive is a virtual coach in

terms of promoting web-based intervention user engagement?

[193] Julius Schoning,¨ Patrick Faion, Gunther Heidemann, and Ulf Krumnack. Providing video annotations

in multimedia containers for visualization and research. In Applications of Computer Vision (WACV),

2017 IEEE Winter Conference on, pages 650–659. IEEE, 2017.

[194] Stephen M Schueller, Ricardo F Munoz,˜ and David C Mohr. Realizing the potential of behavioral

intervention technologies. Current Directions in Psychological Science, 22(6):478–483, 2013.

130 [195] Scott E Seibert, Leisa D Sargent, Maria L Kraimer, and Kohyar Kiazad. Linking developmental

experiences to leader effectiveness and promotability: The mediating role of leadership self-efficacy

and mentor network. Personnel Psychology, 70(2):357–397, 2017.

[196] S Shacham. A shortened version of the profile of mood states. Journal of Personality Assesment,

47(3):305–306, 1983.

[197] Kotaro Shoji, Charles Benight, Austin Mullings, CarolynYeager, Svati Dhamija, and Terry Boult.

Measuring engagement into the web-intervention by the quality of voice. ISRII, 2016.

[198] Kotaro Shoji, Amanda Devane, Svati Dhamija, Terry Boult, and Charles Benight. Recurrence of heart

rate and skin conductance during a web-based intervention for trauma survivors. ISTSS, 2017.

[199] Mohammad Soleymani, Sadjad Asghari-Esfeden, Yun Fu, and Maja Pantic. Analysis of eeg signals

and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing,

7(1):17–28, 2016.

[200] Mohammad Soleymani and Martha Larson. Crowdsourcing for affective annotation of video: Devel-

opment of a viewer-reported boredom corpus. In Workshop on Crowdsourcing for Search Evaluation,

SIGIR 2010., 2010.

[201] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. A multimodal database for

affect recognition and implicit tagging. IEEE Transactions on Affective Computing, 3(1):42–55, 2012.

[202] S Steinmetz, C Benight, S Bishop, and L James. My disaster recovery: a pilot randomized controlled

trial of an internet intervention. Anxiety Stress Coping, 25(5):593–600, 2012.

[203] Sarah E Steinmetz, Charles C Benight, Sheryl L Bishop, and Lori E James. My disaster recovery: a

pilot randomized controlled trial of an internet intervention. Anxiety, Stress & Coping, 25(5):593–600,

2012.

[204] Giota Stratou and Louis-Philippe Morency. Multisense-context-aware nonverbal behavior analysis

framework: A psychological distress use case. IEEE Transactions on Affective Computing.

131 [205] Victor Strecher, Jennifer McClure, Gwen Alexander, Bibhas Chakraborty, Vijay Nair, Janine Konkel,

Sarah Greene, Mick Couper, Carola Carlier, Cheryl Wiese, et al. The role of engagement in a tailored

web-based smoking cessation program: randomized controlled trial. Journal of medical Internet

research, 10(5):e36, 2008.

[206] Yu Sun, Sijung Hu, Vicente Azorin-Peris, Roy Kalawsky, and Stephen Greenwald. Noncontact

imaging photoplethysmography to effectively access pulse rate variability. Journal of biomedical optics,

18(6):061205–061205, 2013.

[207] Kaloyan S Tanev, Scott P Orr, Edward F Pace-Schott, Michael Griffin, Roger K Pitman, and Patricia A

Resick. Positive association between nightmares and heart rate response to loud tones: relationship to

parasympathetic dysfunction in ptsd nightmares. The Journal of nervous and mental disease, 205(4):308,

2017.

[208] Chuangao Tang, Wenming Zheng, Jingwei Yan, Qiang Li, Yang Li, Tong Zhang, and Zhen Cui. View-

independent facial action unit detection. In Automatic Face & Gesture Recognition (FG 2017), 2017

12th IEEE International Conference on, pages 878–882. IEEE, 2017.

[209] Kevin Tang, Manohar Paluri, Li Fei-Fei, Rob Fergus, and Lubomir Bourdev. Improving image

classification with location context. In Proceedings of the IEEE International Conference on Computer

Vision, pages 1008–1016, 2015.

[210] Mika P Tarvainen, Juha-Pekka Niskanen, Jukka A Lipponen, Perttu O Ranta-Aho, and Pasi A Kar-

jalainen. Kubios hrv–heart rate variability analysis software. Computer methods and programs in

biomedicine, 113(1):210–220, 2014.

[211] Thales Teixeira, Michel Wedel, and Rik Pieters. Emotion-induced engagement in internet video

advertisements. Journal of Marketing Research, 49(2):144–159, 2012.

132 [212] P Daphne Tsatsoulis, Paige Kordas, Michael Marshall, David Forsyth, and Agata Rozga. The static

multimodal dyadic behavior dataset for engagement prediction. In Computer Vision–ECCV 2016

Workshops, pages 386–399. Springer, 2016.

[213] Hungwei Tseng and Yu-Chun Kuo. Growth mindsets and flexible thinking in first-time online students?

self-efficacy in learning. In Society for Information Technology & Teacher Education International

Conference, pages 299–302. Association for the Advancement of Computing in Education (AACE),

2017.

[214] Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F Cohn, and Nicu Sebe.

Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2396–2404,

2016.

[215] Sepehr Valipour, Mennatullah Siam, Martin Jagersand, and Nilanjan Ray. Recurrent fully convolutional

networks for video segmentation. In Applications of Computer Vision (WACV), 2017 IEEE Winter

Conference on, pages 29–36. IEEE, 2017.

[216] Hamed Valizadegan, Quang Nguyen, and Milos Hauskrecht. Learning classification models from

multiple experts. Journal of biomedical informatics, 46(6):1125–1135, 2013.

[217] Geert JM van Boxtel, Pierre JM Cluitmans, Roy JEM Raymann, Martin Ouwerkerk, Ad JM Denissen,

Marian KJ Dekker, and Margriet M Sitskoorn. Heart rate variability, sleep, and the early detection of

post-traumatic stress disorder. In Sleep and Combat-Related Post Traumatic Stress Disorder, pages

253–263. Springer, 2018.

[218] Sidni Alanna Vaughn. Understanding developmental and risk-status effects on visual engagement:

Evaluation of infant play behavior. PhD thesis, Georgia Institute of Technology, 2017.

133 [219] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image

caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition,

pages 3156–3164, 2015.

[220] Carl Vondrick and Deva Ramanan. Video annotation and tracking with active learning. In Advances in

Neural Information Processing Systems, pages 28–36, 2011.

[221] R Walecki, O Rudovic, V Pavlovic, B Schuller, and M Pantic. Deep structured learning for facial

expression intensity estimation. CVPR, 2017.

[222] Robert Walecki, Vladimir Pavlovic, Bjorn¨ Schuller, Maja Pantic, et al. Deep structured learning for

facial action unit intensity estimation. arXiv preprint arXiv:1704.04481, 2017.

[223] Jing Wang, Pei Lei, Kun Wang, Lijuan Mao, and Xinyu Chai. Mood states recognition of rowing

athletes based on multi-physiological signals using pso-svm. E-Health Telecommunication Systems and

Networks, 2014, 2014.

[224] Xiaoyang Wang and Qiang Ji. A hierarchical context model for event recognition in surveillance video.

In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.

[225] Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of

crowds. In Advances in neural information processing systems, pages 2424–2432, 2010.

[226] J. Whitehill, Z. Serpell, Y.-C. Lin, A. Foster, and J. R. Movellan. Faces of engagement: Automatic

recognition of student engagement from facial expressions. IEEE Trans. on Affective Computing,

5(3):86–98, 2014.

[227] Jacob Whitehill, Zewelanji Serpell, Yi-Ching Lin, Aysha Foster, and Javier R Movellan. The faces of

engagement: Automatic recognition of student engagementfrom facial expressions. IEEE Transactions

on Affective Computing, 5(1):86–98, 2014.

134 [228] Martin Wollmer,¨ Moritz Kaiser, Florian Eyben, Bjorn¨ Schuller, and Gerhard Rigoll. Lstm-modeling of

continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing,

31(2):153–163, 2013.

[229] Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. Harnessing object and scene semantics

for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 3112–3121, 2016.

[230] Zhongwen Xu, Yi Yang, and Alex G Hauptmann. A discriminative cnn video representation for event

detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages

1798–1807, 2015.

[231] Hao Yang and Hui Zhang. Efficient 3d room shape recovery from a single panorama. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, pages 5422–5430, 2016.

[232] Jimei Yang, Brian Price, Scott Cohen, and Ming-Hsuan Yang. Context driven scene parsing with

attention to rare classes. In Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 3294–3301, 2014.

[233] Shuang Yang, Ognjen Rudovic, Vladimir Pavlovic, and Maja Pantic. Personalized modeling of facial

action unit intensity. In International Symposium on Visual Computing, pages 269–281. Springer, 2014.

[234] Georgios N Yannakakis and Ana Paiva. Emotion in games. Handbook on affective computing, pages

459–471, 2014.

[235] C Yeager. Understanding engagement with a trauma recovery web intervention using the health action

process approach framework. PhD Thesis, 2016.

[236] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga,

and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of

the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.

135 [237] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv

preprint arXiv:1409.2329, 2014.

[238] Gloria Zen, Lorenzo Porzi, Enver Sangineto, Elisa Ricci, and Nicu Sebe. Learning personalized

models for facial expression analysis and gesture recognition. IEEE Transactions on Multimedia,

18(4):775–788, 2016.

[239] Zheng Zhang, Jeff M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael

Reale, Andy Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin. Multimodal sponta-

neous emotion corpus for human behavior analysis. In The IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), June 2016.

[240] Rui Zhao, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Saliency detection by multi-context

deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

pages 1265–1274, 2015.

[241] Yuqian Zhou and Bertram E Shi. Action unit selective feature maps in deep networks for facial

expression recognition. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages

2031–2038. IEEE, 2017.

136