Predicting Latent Narrative Mood Using Audio and Physiologic Data

Predicting Latent Narrative Mood using Audio and Physiologic Data Tuka AlHanai and Mohammad Mahdi Ghassemi* Massachusetts Institute of Technology, Cambridge MA 02139, USA [email protected], [email protected] Abstract message can take on a plurality of meanings, depending on the emotional intent of the speaker. The phrase ”Thanks a Inferring the latent emotive content of a narrative requires lot” may communicate gratitude, or anger, depending on the consideration of para-linguistic cues (e.g. pitch), linguistic tonality, pitch and intonation of the spoken delivery. content (e.g. vocabulary) and the physiological state of the narrator (e.g. heart-rate). In this study we utilized a combina- Given it’s importance for communication, the conse- tion of auditory, text, and physiological signals to predict the quences of misreading emotional intent can be severe, par- mood (happy or sad) of 31 narrations from subjects engaged ticularly in high-stakes social situations such as salary ne- in personal story-telling. gotiations or job interviews. For those afflicted by chronic We extracted 386 audio and 222 physiological features (us- social disabilities such as Asberger’s syndrome, the inabil- ing the Samsung Simband) from the data. A subset of 4 au- ity to read subtle emotional ques can lead to a variety of dio, 1 text, and 5 physiologic features were identified using negative consequences, from social isolation to depression Sequential Forward Selection (SFS) for inclusion in a Neural (Muller,¨ Schuler, and Yates 2008; Cameron and Robinson Network (NN). These features included subject movement, 2010). Machine-aided assessments of historic and real-time cardiovascular activity, energy in speech, probability of voic- interactions may help facilitate more effective communica- ing, and linguistic sentiment (i.e. negative or positive). We tion for such individuals by allowing for long-term social explored the effects of introducing our selected features at coaching and in-the-moment interventions. various layers of the NN and found that the location of these features in the network topology had a significant impact on In this paper, we present the first steps toward the real- model performance. ization of such a system. We present a novel multi-modal To ensure the real-time utility of the model, classifica- dataset containing audio, physiologic, and text transcriptions tion was performed over 5 second intervals. We evaluated from 31 narrative conversations. As far as we know, this our model’s performance using leave-one-subject-out cross- is the first experimental set-up to include individuals en- validation and compared the performance to 20 baseline mod- gaged in natural dialogue with the particular combination els and a NN with all features included in the input layer. of signals we collected and processed: para-linguistic cues from audio, linguistic features from text transcriptions (av- Index Terms: emotion, neural networks, physiology, feature erage postive/negative sentiment score), Electrocardiogram selection, acoustics. (ECG), Photoplethysmogram (PPG), accelerometer, gyro- scope, bio-impedance, electric tissue impedance, Galvanic Introduction Skin Response (GSR), and skin temperature. The emotional content of communication exists at multi- Human communication depends on a delicate interplay be- ple levels of resolution. For instance, the overall nature of tween the emotional intent of the speaker, and the linguis- a story could be positive but it may still contain sad mo- tic content of their message. While linguistic content is de- ments. Hence, we present two analyses in this paper. In the livered in words, emotional intent is often communicated first analysis, we train a Neural Network (NN) to classify the though additional modalities including facial expressions, overall emotional nature of the subject’s historic narration. spoken intonation, and body gestures. Importantly, the same In the second analysis, we train a NN to classify emotional *Both authors contributed equally to this work. Tuka AlHanai content, in real-time. We also show how the optimization of would like to acknowledge the Al-Nokhba Scholarship and the Abu network topology, and the placement of features within the Dhabi Education Council for their support. Mohammad Ghassemi topology improves classification performance. would like to acknowledge the Salerno foundation, The NIH Neu- roimaging Training Grant (NTP-T32 EB 001680), the Advanced Literature Review Neuroimaging Training Grant (AMNTPT90 DA 22759). The authors would also like to acknowledge Hao Shen for aiding with data Cognitive scientists indicate that emotional states are labeling, and Kushal Vora for arranging access to the Simbands. strongly associated with quantifiable physical correlates in- Copyright © 2017, Association for the Advancement of Artificial cluding the movement of facial muscles, vocal acoustics, Intelligence (www.aaai.org). All rights reserved. peripheral nervous system activity, and language use (Bar- rett, Lindquist, and Gendron 2007). The detection of a latent Data Quantity emotional state by a machine agent is strongly influenced Subjects 10 (4 male, 6 female) by both the number of physical correlates available (e.g. au- Narratives 31 (15 happy, 16 sad) dio alone, versus audio and visual), the context in which the Average Duration 2.2 mins correlates are observed (e.g. enhanced heart rate during fear Total Duration 67 mins versus excitement) (Barrett, Mesquita, and Gendron 2011), 5 Sec Segments 804 as well as the social context in which the conversations take Quantity Features place (e.g. a sports stadium versus a meeting room) (Rilliard Total (selected) et al. 2009). Physiologic 222 (5) Importantly, the interactions between these physical cor- Audio 386 (4) relates also have unique associations with the latent emo- Text 2 (1) tional states. For instance, a combination of high heart rate and voicing is associated with excitement although neither Table 1: Under the Data heading we display information on are independently discriminative. It follows that emotive es- the total number of subjects, narratives, and samples used in timation is aided by (1) access to multiple data modalities, the study. Under the Features heading we provide informa- and (2) the utilization of techniques which can account for tion on the total number of features collected in each of the the complex interactions between those modalities. modalities (physiologic, audio and text), and the proportion Existing data in the domain of emotion detection have of the features selected for inclusion in our model. been collected using a variety of tasks including the pre- sentation of images, video clips, and music. Data has also been collected though problem solving tasks, facial expres- 10 participating individuals was 23. Four participants iden- sions exercises, acting, and scripted speech (Douglas-Cowie tified as male, and six participants identified as female. All et al. 2003; Calvo and D’Mello 2010). The highly controlled ten individuals listed English as their primary language. nature of these studies enhances the ability of investigators to identify features which relate to emotive state at the cost Experimental Venue and Approach The experimental of practical utility. That is, there is relatively little work on venue was a 200 square foot temperature and light con- emotive description of spontaneous human interactions in a trolled conference room on the MIT campus. Upon arrival natural setting (Koolagudi and Rao 2012). to the experimental venue, participants were outfitted with a With respect to data analysis, existing studies have ap- Samsung Simband, a wearable device which collects high- plied techniques ranging from regression and Analysis of resolution physiological measures. Audio data was recorded Variance (ANOVA) to Support Vector Machines (SVM), on an Apple iPhone 5S. Clustering and Hidden Markov Modeling (HMM). With Next, participants were provided with the following ex- the recent interest in ’Deep’ learning, Neural Networks are perimental prompt: ”In whatever order you prefer, tell us at also increasingly utilized in emotion detection for speech least one happy story, and at least one sad story.”. If sub- (Stuhlsatz et al. 2011; Li et al. 2013; Han, Yu, and Ta- jects asked for additional clarification about story content, shev 2014), audio/video (Kahou et al. 2013; Wollmer¨ et al. they were informed that there was ”no correct answer”, and 2010), audio/video/text (Wollmer¨ et al. 2013), and physio- were encouraged to tell any story they subjectively found to logic data (Haag et al. 2004; Wagner, Kim, and Andre´ 2005; be happy or sad. A summary of the collected data is pre- Walter et al. 2011; Katsis et al. 2008). Many of the surveyed sented under the Data heading in Table 1. studies, however, utilize only a single modality of data (text, audio, video, or physiologic) for the inference task. While Time-Series Segmentation these studies succeed in enhancing knowledge, a real-world Collected data was time-aligned and segmented using 5 sec- implementation will require the consideration of multiple ond non-overlapping windows. Our window size was se- data modalities, with techniques that account for multiple lected such that the minimum number of spoken words ob- levels of interaction between the modalities in real-time, just served within any given segment was two or more. This cri- as humans do (D’Mello and Kory 2012). teria was necessary to evaluate the effects of transcribed text features. Smaller window sizes

Predicting Latent Narrative Mood Using Audio and Physiologic Data

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support