
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Driver Frustration Detection from Audio and Video in the Wild Irman Abdic,´ 1,2 Lex Fridman,1 Daniel McDuff,1 Erik Marchi,2 Bryan Reimer,1 Bjorn¨ Schuller3 1Massachusetts Institute of Technology (MIT), USA 2Technische Universitat¨ Munchen¨ (TUM), Germany 3Imperial College London (ICL), UK Abstract We present a method for detecting driver frustra- tion from both video and audio streams captured during the driver’s interaction with an in-vehicle voice-based navigation system. The video is of the driver’s face when the machine is speaking, and the audio is of the driver’s voice when he or she is speaking. We analyze a dataset of 20 drivers that contains 596 audio epochs (audio clips, with du- ration from 1 sec to 15 sec) and 615 video epochs (video clips, with duration from 1 sec to 45 sec). (a) Class 1: Satisfied with Voice-Based Interaction The dataset is balanced across 2 age groups, 2 ve- hicle systems, and both genders. The model was subject-independently trained and tested using 4- fold cross-validation. We achieve an accuracy of 77.4 % for detecting frustration from a single audio epoch and 81.2 % for detecting frustration from a single video epoch. We then treat the video and au- dio epochs as a sequence of interactions and use de- cision fusion to characterize the trade-off between decision time and classification accuracy, which improved the prediction accuracy to 88.5 % after 9 epochs. (b) Class 2: Frustrated with Voice-Based Interaction 1 Introduction The question of how to design an interface in order to max- Figure 1: Representative video snapshots from voice navi- imize driver safety has been extensively studied over the gation interface interaction for two subjects. The subject (a) past two decades [Stevens et al., 2002]. Numerous publi- self-reported as not frustrated with the interaction and the (b) cations seek to aid designers in the creation of in-vehicle in- subject self-reported as frustrated. In this paper, we refer to terfaces that limit demands placed upon the driver [NHTSA, subjects in the former category as “satisfied” and the latter 2013]. As such, these efforts aim to improve the likelihood category as “frustrated.” As seen in the images, the “satisfied” of driver’s to multi-task safely. Evaluation questions usually interaction is relatively emotionless, and the “frustrated” in- take the form of “Is HCI system A better than HCI system B, teraction is full of affective facial actions. and why?”. Rarely do applied evaluations of vehicle systems consider the emotional state of the driver as a component of demand that is quantified during system prove out, despite in the interaction of the human driver with the driver vehi- of numerous studies that show the importance of affect and cle interface (DVI) system. We focus on the specific affective emotions in hedonics and aestetics to improve user experi- state of “frustration” as self-reported by the driver in response ence [Mahlke, 2005]. to voice-based navigation tasks (both entry and cancellation The work in this paper is motivated by a vision for an adap- of the route) completed while underway. We then propose a tive system that is able to detect the emotional response of method for detecting frustration from the video of the driver’s the driver and adapt, in order to aid driving performance. The face when he or she is listening to system feedback and the critical component of this vision is the detection of emotion audio of the driver’s voice when he or she is speaking to the 1354 system. ing system (FACS) [Ekman and Friesen, 1977] is the most We consider the binary classification problem of a “frus- widely used and comprehensive taxonomy of facial behav- trated” driver versus a “satisfied” driver annotated based on a ior. Automated software provides a consistent and scalable self-reported answer to the following question: “To what ex- method of coding FACS. The facial actions, and combina- tent did you feel frustrated using the car voice navigation in- tions of actions, have associations with different emotional terface?” The answers were on a scale of 1 to 10 and naturally states and levels of emotional valence. For example, lip de- clustered into two partitions as discussed in §3.2. Represen- pressing (AU15 - frowning) typically is associated with neg- tative examples from a “satisfied” and a “frustrated” driver ative valence and states such as sadness or fear. are shown in Fig. 1. As the question suggests, these affec- The audio stream is a rich source of information and the tive categories refer not to the general emotional state of the literature shows its importance in the domain of in-car affect driver but to their opinion of the interaction with an in-vehicle recognition [Eyben et al., 2010]. In fact, for the recognition technology. It is interesting to note that smiling was a com- of driver states like anger, irritation, or nervousness, the audio mon response for a “frustrated” driver. This is consistent with stream is particularly valuable [Grimm et al., 2007]. This is previous research that found smiles can appear in situations not surprising considering how strongly anger is correlated to when people are genuinely frustrated [Hoque et al., 2012]. simple speech features like volume (and energy respectively) The reason for these smiles may be that the voice-based inter- or pitch. action was “lost-in-translation” and this was in part entertain- The task of detecting drivers’ frustration has been re- ing. Without contextual understanding, an observer of short searched in the past [Boril et al., 2010]. Boril et al. ex- clip might label the emotional state as momentarily happy. ploited the audio stream of the drivers’ speech and discrim- However, over the context of an entire interaction the obvi- inated “neutral” and “negative” emotions with 81.3 % accu- ous label becomes one of “frustrated”. Thus, detecting driver racy (measured in Equal Accuracy Rate – EAR) across 68 frustration is challenging because it is expressed through a subjects. This work used SVMs to discriminate between complex combination of actions and reactions as observed classes. The ground truth came from one annotation se- throughout facial expressions and qualities in one’s speech. quence. A “humored” state was presented as one of the 5 “neutral” (non-negative) emotions. This partitioning of emo- tion contradicts our findings that smiling and humor are often 2 Related Work part of the response by frustrated subject. Therefore, an exter- Affective computing, or the detection and consideration of nal annotator may be tempted to label a smiling subject as not human affective states to improve HCI, was introduced two frustrated, when in fact, the smile may be one of the strongest decades ago [Picard, 1997]. Context-sensitive intelligent sys- indicators of frustration, especially in the driving context. tems have increasingly become a part of our lives in, and out- side, of the driving context [Pantic et al., 2005]. And while Contributions We extend this prior work by (1) leverag- detection of emotion from audio and video has been exten- ing audiovisual data collected under real driving conditions, sively studied [Zeng et al., 2009], it has not received much (2) using self-reported rating of the frustration for data an- attention in the context of driving where research has focused notation, (3) fusing audio and video as complimentary data to a large extent on characterization and detection of distrac- sources, and (4) fusing audio and video streams across time tion and drowsiness. Our work takes steps toward bridging in order to characterize the trade-off between decision time the gap between affective computing research and applied and classification accuracy. We believe that this work is the driving research for DVI evaluation and real-time advanced first to address the task of detecting self-reported frustration driver assistance systems (ADAS) development. under real driving conditions. The first automated system for detecting frustration via multiple signals were proposed by [Fernandez and Picard, 3 Dataset Collection and Analysis 1998]. Most of the subsequent studies over the past decade have been examining affect and emotion in HCI with an 3.1 Data Collection aim to reduce the user’s frustration while interacting with The dataset used for frustration detection was collected as the computer. In many cases “violent and abusive” behav- part of a study for multi-modal assessment of on-road demand ior toward computers has been reported [Kappas and Kramer,¨ of voice and manual phone calling and voice navigation entry 2011]. Affective computing is relevant to HCI in a num- across two embedded vehicle systems [Mehler et al., 2015]. ber of ways. Four broad areas of interest are: (1) reducing Participants drove one of two standard production vehicles, a user frustration; (2) enabling comfortable communication of 2013 Chevrolet Equinox (Chevy) equipped with the MyLink user emotion; (3) developing infrastructure and applications system and a 2013 Volvo XC60 (Volvo) equipped with the to handle affective information; and, (4) building tools that Sensus system. help develop social-emotional skills [Picard, 1999]. It has The full study dataset is composed of 80 subjects that fully been emphasized that for the successful design of future HCI met the selection criteria as detailed in [Mehler et al., 2015], systems the “emotional design” has to explore the interplay equally balanced across two vehicles by gender (male, fe- of cognition and emotion, rather than dismissing cognition male) and four age groups (18–24, 25–39, 40–54, 55 and entirely [Hassenzahl, 2004]. older). In the original study, each subject had to accomplish The face is one of the richest channels for communicating three tasks: (1) entering an address into the navigation sys- information about one’s internal state.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-