A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism
Total Page:16
File Type:pdf, Size:1020Kb
electronics Review A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism Eva Lieskovská *, Maroš Jakubec, Roman Jarina and Michal Chmulík Faculty of Electrical Engineering and Information Technology, University of Žilina, Univerzitná 8215/1, 010 26 Žilina, Slovakia; [email protected] (M.J.); [email protected] (R.J.); [email protected] (M.C.) * Correspondence: [email protected] Abstract: Emotions are an integral part of human interactions and are significant factors in determin- ing user satisfaction or customer opinion. speech emotion recognition (SER) modules also play an important role in the development of human–computer interaction (HCI) applications. A tremendous number of SER systems have been developed over the last decades. Attention-based deep neural networks (DNNs) have been shown as suitable tools for mining information that is unevenly time distributed in multimedia content. The attention mechanism has been recently incorporated in DNN architectures to emphasise also emotional salient information. This paper provides a review of the recent development in SER and also examines the impact of various attention mechanisms on SER performance. Overall comparison of the system accuracies is performed on a widely used IEMOCAP benchmark database. Citation: Lieskovská, E.; Jakubec, M.; Keywords: speech emotion recognition; deep learning; attention mechanism; recurrent neural Jarina, R.; Chmulík, M. A Review on network; long short-term memory Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 2021, 10, 1163. https://doi.org/10.3390/ 1. Introduction electronics10101163 The aim of human–computer interaction (HCI) is not only to create a more effective and natural communication interface between people and computers, but its focus also lies Academic Editors: Matus Pleva, on creating the aesthetic design, pleasant user experience, help in human development, on- Yuan-Fu Liao, Patrick Bours and line learning improvement, etc. Since emotions form an integral part of human interactions, Chiman Kwan they have naturally become an important aspect of the development of HCI-based applica- tions. Emotions can be technologically captured and assessed in a variety of ways, such Received: 11 March 2021 as facial expressions, physiological signals, or speech. With the intention of creating more Accepted: 9 May 2021 Published: 13 May 2021 natural and intuitive communication between humans and computers, emotions conveyed through signals should be correctly detected and appropriately processed. Throughout Publisher’s Note: MDPI stays neutral the last two decades of research focused on automatic emotion recognition, many machine with regard to jurisdictional claims in learning techniques have been developed and constantly improved. published maps and institutional affil- Emotion recognition is used in a wide variety of applications. Anger detection can iations. serve as a quality measurement for voice portals [1] or call centres. It allows adapting provided services to the emotional state of customers accordingly. In civil aviation, moni- toring the stress of aircraft pilots can help reduce the rate of a possible aircraft accident. Many researchers, who seek to enhance players’ experiences with video games and to keep them motivated, have been incorporating the emotion recognition module into their Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. products. Hossain et al. [2] used multimodal emotion recognition for quality improvement This article is an open access article of a cloud-based gaming experience through emotion-aware screen effects. The aim is to distributed under the terms and increase players’ engagement by adjusting the game in accordance with their emotions. In conditions of the Creative Commons the area of mental health care, a psychiatric counselling service with a chatbot is suggested Attribution (CC BY) license (https:// in [3]. The basic concept consists of the analysis of input text, voice, and visual clues in creativecommons.org/licenses/by/ order to assess the subject’s psychiatric disorder and inform about diagnosis and treatment. 4.0/). Another suggestion for emotion recognition application is a conversational chatbot, where Electronics 2021, 10, 1163. https://doi.org/10.3390/electronics10101163 https://www.mdpi.com/journal/electronics Electronics 2021, 10, x FOR PEER REVIEW 2 of 29 Electronics 2021, 10, 1163 clues in order to assess the subject’s psychiatric disorder and inform about diagnosis2 ofand 29 treatment. Another suggestion for emotion recognition application is a conversational chatbot, where speech emotion identification can play a role in better conversation [4]. A speechreal-time emotion SER application identification should can playfind a an role op intimal better trade conversation-off between [4]. Aless real-time computing SER applicationpower, fast shouldprocessing find times an optimal, and a trade-off high degree between of accuracy. less computing power, fast processing times,In and this a review, high degree we focus of accuracy. on works dealing with the processing of acoustic clues from speechIn to this recogni review,se wethe focusspeaker on’s works emotions. dealing The with task theof speech processing emotion of acoustic recognition clues (SER) from speechis traditionally to recognise divided the speaker’s into two emotions.main parts: The feature task of extraction speech emotion and classification recognition (SER), as de- is traditionallypicted in Figure divided 1. During into two the mainfeature parts: extraction feature stage, extraction a speech and signal classification, is converted as depicted to nu- inmerical Figure values1. During using the various feature extractionfront-end signal stage, aprocessing speech signal techniques. is converted Extracted to numerical feature valuesvectors using have variousa compact front-end form and signal ideally processing should techniques. capture essential Extracted information feature vectors from have the asignal. compact In the form back and-end, ideally an appropriate should capture classifier essential is selected information according from theto the signal. task Into the be back-end,performed. an appropriate classifier is selected according to the task to be performed. Figure 1. Block scheme of general speechspeech emotion recognition system.system. Examples of widely used acoustic features are mel-frequencymel-frequency cepstralcepstral coefficientscoefficients (MFCCs), linear prediction cepstral coefficientscoefficients (LPCC), short-timeshort-time energy, fundamentalfundamental frequencyfrequency (F0), (F0), f formantsormants [5,6] [5,6,], etc etc.. Traditional Traditional classification classification techniques techniques include include probabil- proba- bilisticistic models models such such asas the the Gaussian Gaussian mixture mixture model model (GMM) (GMM) [6 [6––8]8],, h hiddenidden Markov modelmodel (HMM) [[9]9],, andand support support vector vector machine machine (SVM (SVM [10 [10–12–].12] Over. Over the the years years of research, of research, also also var- iousvarious artificial artificial neural neural network network architectures architectures have have been been utilised, utili fromsed, from the simplest the simplest multilayer mul- perceptrontilayer perceptron (MLP) [(MLP)8] through [8] through extreme extreme learning learning machine machine (ELM) [(ELM)13], convolutional[13], convolutional neural networksneural networks (CNNs) (CNNs) [14,15], [14,15] to deep, to architectures deep architectures of residual of residual neural networks neural networks (ResNets) (Res- [16] andNets) recurrent [16] and neuralrecurrent networks neural (RNNs)networks [17 (RNNs),18]. In [17,18] particular,. In particular, long short-term long short memory-term (LSTM)memory and (LSTM) gated and recurrent gated recurrent units (GRU)-based units (GRU) neural-based networks neural networks (NNs), as (NN state-of-the-arts), as state- solutionsof-the-art insolutions time-sequence in time modelling,-sequence havemodelling, been widely have been utilised widely in speech utilised signal in speech modelling. sig- Innal addition, modelling. various In addition, end-to-end various architectures end-to-end havearchitectures been proposed have been to learnproposed jointly to learn both extractionjointly both of extraction features and of features classification and classification [15,19,20]. [15,19,20]. Besides LSTM and GRU networks, the introduction of an attention mechanism (AM) in deepdeep learninglearning may may be be considered considered as as another another milestone milestone in sequentialin sequential data data processing. processing. The purposeThe purpose of AM of is,AM as is, with as with human human visual visual attention, attention, to select to select relevant relevant information information and filter and outfilter irrelevant out irrelevant ones. ones. The attention The attention mechanism, mechanism first, introducedfirst introduced for a machinefor a machine translation trans- tasklation [21 task], has [21] become, has become an essential an essential component component of neural of neural architectures. architectures. Incorporating Incorporating AM intoAM encoder–decoder-basedinto encoder–decoder-based neural neural architectures architectures significantly significantly boosted boosted the performance the perfor- of machinemance of translation machine translation even for long even sequences for long [sequences21,22].