Multimodal Emotion Recognition Using Deep Learning Architectures

Multimodal Emotion Recognition using Deep Learning Architectures Hiranmayi Ranganathan, Shayok Chakraborty and Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing (CUbiC) Arizona State University fhiranmayi.ranganathan,shayok.chakraborty,[email protected] Abstract cific features [5],[6]. Deep architectures and learning techniques have shown to overcome these limitations by captur- Emotion analysis and recognition has become an in- ing complex non-linear feature interactions in multimodal teresting topic of research among the computer vision re- data [7]. search community. In this paper, we first present the emoF- In this paper, as our first contribution, we present the BVP database of multimodal (face, body gesture, voice and emoFBVP database of multimodal recordings-facial expres- physiological signals) recordings of actors enacting vari- sions, body gestures, vocal expressions and physiological ous expressions of emotions.The database consists of au- signals, of actors enacting various expressions of emotions. dio and video sequences of actors displaying three differ- The database consists of audio and video sequences of ac- ent intensities of expressions of 23 different emotions along tors enacting 23 different emotions in three varying intensi- with facial feature tracking, skeletal tracking and the corre- ties of expressions along with facial feature tracking, skele- sponding physiological data. Next, we describe four deep tal tracking and the corresponding physiological data. This belief network (DBN) models and show that these models is one of the first emotion databases that has recordings of generate robust multimodal features for emotion classifica- varying intensities of expressions of emotions in multiple tion in an unsupervised manner. Our experimental results modalities recorded simultaneously. We strongly believe show that the DBN models perform better than the state that the affective computing community will greatly benefit of the art methods for emotion recognition. Finally, we from the large collection of modalities recorded. Our sec- propose convolutional deep belief network (CDBN) mod- ond contribution investigates the use of deep learning archi- els that learn salient multimodal features of expressions of tectures - DBNs and CDBNs for multimodal emotion recog- emotions. Our CDBN models give better recognition accu- nition. We describe four deep belief network (DBN) mod- racies when recognizing low intensity or subtle expressions els and show that they generate robust multimodal features of emotions when compared to state of the art methods. for emotion classification in an unsupervised manner. This is done to validate the use of our emoFBVP database for multimodal emotion recognition studies. The DBN models 1. Introduction used are extensions of the models proposed by [7] for audio- visual emotion classification. Finally, we propose convolu- In recent years, there has been a growing interest in tional deep belief network (CDBN) models that learn salient the development of technology to recognize an individual’s multimodal features of low intensity expressions of emo- emotional state. There is also an increase in the use of mul- tions. timodal data (facial expressions, body expressions, vocal expressions and physiological signals) to build such tech- 2. Related Work nologies. Each of these modalities have very distinct sta- tistical properties and fusing these modalities helps us learn Previous research has shown that deep architectures ef- useful representations of the data. Emotion recognition is a fectively generate robust features by exploiting the com- process that uses low level signal cues to predict high level plex non-linear interactions in the data [8]. Deep architectures and learning techniques are very popular in the speech emotion labels. Literature has shown various techniques and language processing community [9]-[11]. Ngiam et for generating robust multimodal features [1]-[4] for emo- al. [12] report impressive results on audio-visual speech tion recognition tasks. The high dimensionality of the data, classification. They use sparse Restricted Boltzmann Ma- the non-linear interactions across the modalities along with chines (RBMs) for cross-modal learning, shared representa- the fact that the way an emotion is expressed varies across tion learning and multimodal fusion on CUAVE and AVLet- people complicate the process of generating emotion spe- ters dataset. Srivastava et al. [13] applied multimodal deep belief networks to learn joint representations that outper- We include a regularization penalty as in [16] given as: formed SVMs. They used multimodal deep Boltzmann ma- k m 2 X 1 X h (l) (l)i chines to learn a generative model of images and text for im- λ p − E h j v (5) m j age retrieval tasks. Kahou et al. used an ensemble of deep j=1 l=1 learning models to perform emotion recognition from video E λ clips [14]. This was the winning submission to the Emo- Here, [.] is the conditional expectation given the data, is tion Recognition in the Wild Challenge [15]. Deep learn- a regularization parameter, and p is a constant that specifies ing has also been applied in many visual recognition studies the activation of the hidden unit. [16]-[20]. Our research is motivated by the above recent ap- Convolutional deep belief networks (CDBNs) [24] are proaches in multimodal deep learning. In this paper, we fo- similar to DBNs and can be trained in a greedy layer-wise cus on applying deep architectures for multimodal emotion fashion. Lee et al [24] used CDBNs to show good perfor- recognition using face, body, voice and physiological sig- mance in many visual recognition tasks. Convolutional re- nal modalities. We apply extensions of known DBN mod- stricted Boltzmann machines (CRBMs) [24]-[26] are build- els for multimodal emotion recognition using the emoFBVP ing blocks for (CDBNs). In a CRBM [22], the weights be- database and investigate recognition accuracies to validate tween the hidden units and visible units are shared among the utility of the database for emotion recognition tasks. all locations in the hidden layer. The CRBM consists of To the best of our knowledge, the use of DBNs for multimodal emotion recognition of data comprising of all the two layers: an input (visible) layer V and a hidden layer above mentioned modalities ( facial expressions, body ges- H. The hidden units are binary-valued, and the visible units tures, vocal expressions and physiological signals) has not are binary-valued or real-valued. Please refer to Lee et al. been explored by the affective research community. [24] for the expression for the energy function, conditional Recent developments in deep learning techniques exploit and joint probabilities. In this paper, we use CRBMs with the use of single layer building blocks called as Restricted probabilistic max-pooling as building blocks for convolu- Boltzmann Machines (RBMs) [21] to build DBNs in an un- tional deep belief networks. For training the CRBMs, we supervised manner. DBNs are constructed by greedy layer- use contrastive divergence [23] to approximate the gradient wise training of stacked RBMs to learn hierarchical repre- of the log-likelyhood term effectively. Like in [16], we add sentations from the multimodal data [22]. RBMs are undi- a sparsity penalty term as well. Post training, we stack the rected graphical models that use binary latent variables to CRBMs to form a CDBN. represent the input. Like [7], we also use Gaussian RBMs for training the first layer of the network. The visible units The rest of the paper is organized as follows. Section of the first layer are real-valued. The deeper layers are 3 describes the emoFBVP database and its salient proper- trained using Bernoulli-Bernoulli RBMs that employ visi- ties. The descriptions of experimental set up for deep learn- ble and hidden units that are binary valued. The joint prob- ing, feature extraction techniques and baseline models are ability distribution for a Gaussian RBM with visible units v in Section 4. Section 5 introduces our DemoDBN models and hidden units h is given as follows: and investigates their usage for multimodal emotion recog- 1 nition in an unsupervised context. Section 6 describes our P (v; h) = exp (−E(v; h)) ; (1) Z CDBN models and investigates their usability to recognize subtle or low intensities of expressions of emotions. Finally, D The corresponding energy function with q 2 R and r 2 we share our conclusions and future work in Section 7. RK as biases of visible and hidden units and W 2 RD×K as weights between visible units and hidden units is given 3. emoFBVP Database as: 1 X 2 To study human emotional experience and expression in E(v; h) = v 2σ2 i more detail and to develop benchmark methods for auto- i ! matic emotion recognition, researchers are in need of rich 1 X X X − q v + r h + v W h ; (2) sets of data. We have recorded responses of actors to affec- σ2 i i j j i i;j j i j i;j tive emotion labels using four different modalities - facial expressions, body expressions, vocal expressions and phys- These parameters are learned using a technique called con- iological signals. Along with the multimodal recordings, trastive divergence, explained in [23]. σ is a hyperparameter and Z is a normalization constant. The conditional proba- we provide facial feature tracking and skeletal tracking data. bility distributions of the Gaussian RBM are as follows: The recordings of all the data are rated through an evalua- tion form completed by the actors immediately after each !! 1 X excerpt of acting emotions. The recordings of this database P (h = 1 j v) = sigmoid W v + r ; (3) j σ2 i;j i j i are synchronized to enable study of simultaneous emotional responses using all the modalities. ! X 2 Ten participants (who are professional actors) were in- P (vi j h) = N vi; Wi;j hj + qi; σ : (4) j volved in data capture, and every participant displayed 23 different emotions.

Load more