
2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification Reza Lotfian Carlos Busso The University of Texas at Dallas The University of Texas at Dallas Email: [email protected] Email: [email protected] Abstract—Automatic recognition of emotions is an important sample. Then, a consensus labels, such as majority vote, part of affect-sensitive human-computer interaction (HCI). Ex- is calculated and assigned as ground truth. This approach pressive behaviors tend to be ambiguous with blended emotions aims to identify popular trends where outlier assessments during natural spontaneous conversations. Therefore, evalu- are ignored. While annotators may report conflicting cat- ators disagree on the perceived emotion, assigning multiple egories, all of them might be right, given the differences emotional classes to the same stimuli (e.g., sadness, anger, sur- in perception and ambiguity in the expressed emotions. prise). These observations have clear implications on emotion Emotions are driven by appraisal of a given situation [5]. classification, where assigning a single descriptor per stimuli Depending on the evaluator’s perspective, multiple answers oversimplifies the intrinsic subjectivity in emotion perception. may be appropriate, especially if many related emotional This study proposes a new formulation, where the emotional categories are available (e.g., surprise, anger, fear, disgust). perception of a stimuli is a multidimensional Gaussian random We hypothesize that individual evaluations provide richer variable with an unobserved distribution. Each dimension emotional descriptions than simple consensus labels, which corresponds to an emotion characterized by a numerical scale. should be effectively leveraged while training classifiers [6], The covariance matrix of this distribution captures the intrin- [7], [8]. The formulation of the classifier should understand sic dependencies between different emotional categories. The the relationship between emotional classes, prioritizing the process where an evaluator judges the stimuli is equivalent to separation of unrelated categories (e.g., anger versus happi- ness) over related emotions (anger versus disgust) [9]. sampling a point from this distribution, reporting the class with the highest value. The proposed approach recursively estimates This study proposes an innovative formulation to lever- this multimodal distribution using numerical methods. The age individual evaluations for emotion recognition, as op- mean of the Gaussian distribution is used as a soft label to posed to consensus labels, considering the intrinsic relation- train a deep neural network (DNN). Our experimental results ship between emotional categories. We formulate emotion show that the proposed training method leads to improvements perception as a probabilistic model, where each speech in F-score over training with (1) hard-labels based on majority segment has a non-observable multivariate Gaussian dis- vote, and (2) soft-label framework proposed by other studies. tribution. Each dimension is associated with the intensity measure of one emotion category (i.e., dimension equals the 1. Introduction number of emotions). Emotional perception is equivalent to Recognizing categorical emotions such as happiness, sampling a point according to this multivariate distribution. sadness or anger from speech can have many practical After listening to the stimuli, the evaluator draws a point applications [1], [2], [3]. Previous studies on categorical in the distribution, reporting the emotional category that emotion recognition rely on the assumption that (1) each has the largest intensity measure. Each individual evaluation speech segment is often assigned to one emotional class, corresponds to a realization drawn from this distribution, and (2) samples belonging to the same class share similar and the task is to approximate this multivariate Gaussian acoustic features. However, the boundaries between emotion distribution using numerical methods. We demonstrate that classes during human interaction are ambiguous, where peo- this formulation can be used in emotion recognition by (1) ple rarely express extreme emotions. Samples labeled with training the classifiers using the expected value of each emo- a given emotional class may have different acoustic char- tional dimension, derived from random Gaussian distribution acteristics (e.g., shades of happiness). These observations of individual samples (e.g., soft labels), and (2) modifying have direct implication in machine learning frameworks for the loss function to weigh errors between emotional class emotion recognition, where conventional approaches do not according to their relations. We refer to this method as soft generalize when evaluated in realistic scenarios [4]. label from the expected intensity of emotion (SL-EIE). The The emotional annotation process of spontaneous proposed method achieves better performance than classi- databases often include perceptual evaluations, where sev- fiers trained with hard labels derived from majority vote eral evaluators are asked to assign an emotional class to each and soft-label proposed by Fayek et al. [7]. 978-1-5386-0563-9/17/$31.00 c 2017 IEEE 415 2. Related work Unlike other speech processing problems such as speaker identification or automatic speech recognition (ASR), defining ground-truth labels for speech emotion recognition is not straightforward. Emotions observed dur- ing human interactions are ambiguous, which are not nec- essarily captured by the target emotional descriptions or the annotation protocol (e.g., forcing an evaluator to choose a class from a predefined list [10]). The standard approach is to collect multiple annotations from experts or na¨ıve eval- uators, aiming to reach a consensus label. Given subjective nature of perceptual evaluations, the resulting ground-truth labels are noisy [9], [11]. Figure 1. Illustration of approach to formulate emotion perception as a When the labels are inherently noisy, studies have pro- probabilistic process. Unobservable distributions of the intensity of happi- ness and neutral for two sentences, evaluated by four evaluators. posed using ensembles [12] and soft-labels [13], describ- ing the probability or intensity associated with each class. sample. The evaluator selects the class with higher intensity In speech emotion recognition, studies have proposed dif- according to his/her judgement. Cases when two annotators ferent variants of this approach, leveraging the individual provide conflicting emotions for a speech segment may evaluations provided to each speech sample (i.e., not just indicate that they perceive emotions with different intensity the consensus label). Mower et al. [14], [15] suggested level. Therefore, the selected class does not imply complete emotional profiles to describe the confidence that an emo- disagreement. We propose to estimate soft label from the tional label is assigned to an utterance. Audhkhasi and expected intensity of emotion (SL-EIE), which we estimate Narayanan [8] proposed fuzzy logic to deal with ambiguous from the individual evaluations. emotional utterances by generalizing class sets with partial membership, rather than single class membership. The study We formulate the problem by modeling perceptual eval- suggested that more information can be extracted from uation as a random vector, where each dimension corre- individual annotations assigned to a given speech sample sponds to one emotional class. The multivariate distribution than a consensus label. Other studies have also explored of this process is hidden and the task is to estimate it the use of individual annotations to create relative labels from the individual evaluations. Figure 1 illustrates the happiness of categorical emotions for preference learning [6]. The key ideas when we consider only two emotions ( neutral study derived a probabilistic relevant score from individual and ) for two speech segments (red and blue). The x neutral annotations, indicating the intensity of a given emotional -axis corresponds to the perceived intensity for , y class (e.g., intensity of happiness in a speech segment). The and the -axis corresponds to the perceived intensity for happiness approach was successfully used to train emotional rankers. Each segment is evaluated by four annotators. Fayek et al. [7] used individual evaluations to create This process is formulated as drawing four points from ensembles with deep neural network (DNN). Each DNN each of these distributions (red points for sentence 1, blue is trained with the annotations from one annotator, pro- squares for sentence 2). Based on the values for x and viding the final score after fusing the ensembles. Alter- y, the evaluator chooses neutral if x > y or happiness if natively, the study trained a single DNN with soft labels y > x. The resulting labels for these eight evaluations are reflecting the proportion of evaluators selecting each class. denoted with H for happiness and N for neutral. Notice Their experimental results show the benefit of considering that the line y = x defines the boundaries between the individual evaluations in building emotion classifiers. This emotions. Majority vote will suggest that one sample is study is related to our paper, so we use this approach happy and the other is neutral. Instead, we aim to estimate as one of the baselines. Our proposed method
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-