1 EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition

Maurice Gerczuk, Shahin Amiriparian, Member, IEEE, Sandra Ottl, and Bjorn¨ W. Schuller, Fellow, IEEE

Abstract—In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EMOSET, is assembled from a number of existing SER corpora. In total, EMOSET contains 84 181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EMONET. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EMOSET. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only 3.5 times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EMOSET. Measured by McNemar’s test, these improvements are further significant for ten datasets at p < 0.05 while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EMONET framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EMONET provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.

Index Terms—deep learning, transfer learning, multi-domain learning, multi-corpus, cross-corpus, speech emotion recognition, computational paralinguistics, computer audition, audio processing !

1 INTRODUCTION

Ith recent advancements in the field of machine approaches [21]. Here, monitoring of patients using only W learning and the widespread availability of computa- audio signals could be seen as less intrusive than video or tionally powerful consumer devices, such as smartphones, physiological based methods. many people are now already interacting with artificial While SER can serve a wide range of purposes from an intelligence (AI) technology on a daily basis. A prominent application stand-point, there are a couple of fundamental example are voice controlled personal assistants, such as characteristics of the field that make it a hard task up to Alexa and Siri [1], [2], which are becoming increasingly this day [22]. Collecting and annotating sufficiently large popular. These products are a first step towards conversa- amounts of data that is suitable to the target application is tional AI, which integrates a variety of research fields, such time consuming [22]. Emotional speech recordings can for as automatic speech recognition (ASR), natural language example be obtained in laboratory setting by recording pro- understanding (NLU), and context modelling [3]. While fessional actors or in the wild [23], e. g. , from movie scenes their current functionality demonstrates their capabilities or online videos [24]. Moreover, after suitable material has for task oriented ASR, they are still a far way off of being been collected, annotation is not straightforward and has to able to converse freely with humans [4] and disregard other consider an additional modelling step in the different ways important aspects of interpersonal communication, such as human emotions can be classified and interpreted [22]. Here, the expression and understanding of emotions. choosing between a categorical and dimensional approach In contrast to an early view of emotions being a disrup- and defining how to represent the temporal dynamics of

arXiv:2103.08310v1 [cs.SD] 10 Mar 2021 tion of organised rational thought and to be controlled by an emotion are two examples of important steps that have to individual [5]–[7], today, emotional intelligence is accepted be taken to arrive at a fitting an emotion model [24], [25]. as a central and guiding function of cognitive ability that The actual process of annotation then has to find ways to meets standards for an intelligence [8]–[11]. In this context, deal with the inherent subjectivity of human emotion, e. g. it has been argued that designing conversational AIs with , by employing multiple annotators, measuring inter-rater both the ability to comprehend and to express emotions agreement and removing data samples containing emotion will improve on the quality and effectiveness of human portrayals that were too ambiguous [22]. For these reasons, machine interaction [12], [13]. Emotion recognition capabil- unlike in many fields that have seen leaps in the state-of-the- ities can furthermore serve a purpose for the integration art due to the paradigm of deep learning, truly large-scale – of machine learning and AI technologies into health-care i. e. , more than one million samples – SER corpora do not where an affective computing approach can support patient exist up to this day. Rather, SER research brought forth a in-home monitoring [14], help with early detection of psy- large number of small-scale corpora which showcase a high chiatric diseases [15] or be applied to support diagnosis and degree of diversity in their nature. treatment in military healthcare [16]. In this regard, SER The sparsity of large SER datasets on the one hand, and is furthermore highly related to automatic speech based the availability of a large number of smaller datasets on the detection of clinical depression [17]–[20], where emotional other, are key motivating factors for our work which applies content can further improve the performance of recognition state-of-the-art paradigms of both transfer and deep learn- 2

ing for the task of multi-corpus SER. For this, we develop e. g. , using a mapping, such as the circumplex model of a multi-corpus framework called EMONET based based on affect [36]. residual adapters [26], [27]. We evaluate this framework For the actual automatic computational recognition of on a large collection of existing SER corpora assembled emotions from audio speech signals, many traditional ap- and organised as EMOSET . Further, all experiments are proaches rely on the extraction of either brute-forced or compared to suitable baseline SER systems. Finally, we handcrafted sets of features computed from frame-level release the source code of our framework together with a low-level descriptors (LLDs), such as energy or spectral comprehensive command line interface on GitHub1, to be information, by applying a range of statistical functionals, freely used by any researcher interested in multi-domain such as mean, standard deviation, and extreme values over audio analysis. a defined longer segment, e. g. , a speaker turn or utter- ance [37], [38]. More recently, the field is being influenced by the trend of deep learning and methods of directly learning 2 RELATED WORKS models for emotion recognition from raw speech data are Three general research areas are of interest for the topic of being investigated [22], [39]. These works are touched upon this manuscript. First, the field of SER deals with the gen- in Section 2.2. eral problem of recognising human emotions from speech In the computational and machine learning sphere, pre- recordings, which is the task that should be solved on a vious works have shown the suitability of cross-corpus multi-corpus basis by the presented approaches. Second, training for SER [40], [41]. Aspects that were researched the field of deep learning where large amounts of data include how to effectively select data from multiple cor- are used to train neural network architectures that are able pora [40], [41] or the effects of adding unlabelled data for to learn high-level feature representations from raw input training [42]. Schuller et al. [43] have investigated various data, gives the direction for the choice of machine learning strategies to cope with inter corpus variance by evaluating models. Here, the DEEP SPECTRUM feature extraction sys- multiple normalisation strategies, such as per-speaker and tem is of special importance, as it will serve as one of the per-corpus feature normalisation, for cross-corpus testing baselines against which the transfer learning experiments on six databases while Kaya et al. [44] applied cascaded are compared. Finally, the field of domain adaptation and normalisation on standard acoustic feature sets. Many of multi-domain training investigates ways of efficiently trans- the above approaches utilise hand-crafted or brute-forced ferring knowledge between domains which is important for feature sets computed from LLDs of the audio content. The the multi-corpus SER problem. approach taken in this manuscript differs from the previous research into cross-corpus SER, in that the focus here lies on harnessing the power of deep learning when applied to raw 2.1 Speech Emotion Recognition input features. SER describes the task of automatically detecting emotional subtext from spoken language and is a research field that has 2.2 Deep Learning Based SER increasingly been of interest for more than two decades [29]. Traditionally, to develop a computational emotion recogni- For computer audition, deep learning has made an impact tion model from speech data, the following steps should be – especially in the ASR domain where large datasets of considered. Initially, unlike for many other current machine recorded speech and corresponding transcriptions could be learning tasks, e. g. , visual object recognition or Automatic utilised to train accurate models [45], [46]. Speech Recognition (ASR), no single definitive method of In comparison to image recognition where training sam- representing human emotion is applicable to every scenario ples can be collected in a large-scale fashion from the in- [22]. Furthermore, a decision has to be made about the ternet and annotation is relatively straight forward, many temporal granularity of annotating and detecting emotions areas of audio recognition do not have this advantage. In from continuous speech signals [22]. For the first aspect, an attempt to improve the situation for audio recognition, two main approaches are used in the literature [30]. In the AudioSet [47] which is a large ontology of collected YouTube categorical approach, emotions can be grouped into discrete clips can serve as basis for training deep, general purpose classes [31], mostly with respect to the Ekman ‘Big Six’ [32], network architectures. These networks can later be used as [33], including anger, disgust, fear, happiness, sadness and feature extractors and transfer models for various recogni- surprise. On the other hand, emotions can be analysed tion tasks. Hershey et al. investigated the viability of adapt- from a continuous and dimensional perspective. Here, an ing standard CNN architectures as used for ImageNet clas- emotion is annotated on multiple axes each describing a dif- sification, such as VGG [48], Inception [49], and ResNet [50], ferent aspect. Most often used dimensions are arousal, valence [51], for large-scale audio recognition [52]. Unlike in image and dominance. Arousal describes the intensity or activation recognition, where the raw pixels of (resized) images can of an emotion, while valence represents the intrinsic positiv- serve as input to the Deep Neural Networks (DNNs), a suit- ity or negativity of a feeling [34]. Finally, emotions can also able data representation for audio content has to be chosen be placed on a continuous dominance-submissiveness scale manually. Additionally, when the machine learning models which, for example, allows discrimination between anger cannot handle variable size inputs, e. g. , CNNs with fully and fear (both low valence high arousal emotions) [35]. connected layers, audio has to be chunked into segments of Translation between categories and dimensions is possible, fixed duration. Tzirakis et al. [53] took a slightly different approach by introducing a Convolutional Recurrent Neural 1. https://github.com/EIHW/EmoNet Network (CRNN) architecture that transforms the audio 3

Image Convolutional Neural Network audio content mel spectrograms high- dimensional feature vector

high-level feature layers

Fig. 1: Overview of the DEEP SPECTRUM feature extraction system. Audio segments are first converted to a suitable image representation in the form of mel-spectrogram plots. Afterwards, they are forwarded through an ImageNet pre-trained deep Convolutional Neural Network (CNN) model. The activations on a specific layer can then be used as high-level feature representations by downstream classifiers, such as linear Support Vector Machines (SVMs). This figure is adapted from [28].

signal to a suitable representation using one-dimensional is transferred to the audio domain by applying the learnt convolutional layers. This time-continuous representation feature extractors to spectrogram representations. Specifi- is then processed by a Recurrent Neural Network (RNN) cally, an audio sample is first converted to a suitable 2- which makes predictions for the learning task at hand. The dimensional time-frequency format – most often a (Mel- whole model can be trained in an end-to-end fashion from )spectrogram or a chromagram – by means of Fourier trans- raw audio signals [54], [55]. formation and application of various filter and scaling oper- Mel-spectrograms are often used as input features for ations. Afterwards, an RGB image representation conform- the DNNs [56]–[59]. As not all time-frequency regions of ing with the input specifications of an ImageNet pre-trained speech samples contain high emotional saliency, methods CNN, e. g. , AlexNet [69], VGG16 [48], or ResNets [50], is have been investigated that learn to focus on the most created by plotting and resizing the spectrogram, mapping important parts of a given input. RNNs, for example, have power to a certain pre-defined colour scale. This image the inherent capability to deal with sequences of variable is then forwarded through the CNN and the learnt filter lengths. When combined with a self attention mechanism, operations are applied to the spectrogram. A specific layer emotionally informative time-segments of an input can be of the extractor network can finally be chosen to serve as highlighted [60]. Mirsamadi et al. used an attention RNN a deep feature descriptor, i. e. , the activations of this layer for SER on Interactive Emotional Dyadic Motion Capture are flattened into a (large) feature vector which can then (IEMOCAP) while Gorrostieta et al. applied a similar model be used as input for various machine learning models. The with low-level spectral features as input for the ComParE whole process is illustrated in Figure 1. self-asessed affect [61] sub-challenge [62]. More recently, The features extracted by the DEEP SPECTRUM system combining CNN feature extractors with attention based have been shown to provide state-of-the-art results for a RNNs has been shown to be a highly competitive approach range of audio analysis tasks in preceding research [28], [39], to SER [63], [64]. Taking inspiration from the combination of [70]–[75]. The results imply that convolutional filters trained Fully Convolutional Networks (FCNs) and two-dimensional on natural image classification can extract useful features for attention pooling for visual online object tracking [65], audio analysis tasks if applied to spectral representations of Neumann and Vu evaluate an attentive CNN for SER on the content. This motivates using DEEP SPECTRUM as one of different input features and find log Mel-spectrograms to the baseline systems to evaluate on EMOSET . work best. Their work highlights another advantage an at- tention pooling method has compared to a more traditional 2.4 Domain Adaptation and Multi-Domain Learning CNN architecture with fully connected layers: Sequences of variable lengths can be handled by the same model without As part of the larger field of transfer learning, domain having to adapt parts of the architecture. This is especially adaptation aims to transfer knowledge learnt on a fully important for SER where utterances often differ in duration, labelled source domain to a target domain where labels and analysing only a short segment of a long utterance are either unavailable or sparse by mitigating the negative often leads to worse results. While the works outlined above impact of domain shift [76]. Specifically, it deals with the demonstrate the efficacy of deep learning based methods for problem that machine learning models trained on large- SER on various databases, hand-crafted features still play an scale datasets are sensitive to dataset bias [77]–[79]. As important role in the field [66]. deep learning approaches can be seen as the state-of-the-art in many machine learning research areas, current domain adaptation research focuses on learning to map deep feature 2.3 Deep Spectrum Feature Extraction representations into a shared space [76]. One of the most First introduced for the task of snore sound classifica- prominent examples can be found in the work of Ganin tion [67], DEEP SPECTRUM2 is a deep feature extraction et al. [80], where a shared feature extractor base serves as toolkit for audio-based on image classification CNNs. In input for both a source domain label predictor and a domain the system, knowledge obtained by CNNs for the task of discriminator. Similarly, Tzeng et al. [76] apply the concept large-scale object recognition on the ImageNet [68] database of adversarial learning to unsupervised domain adaptation by first pre-training a source domain feature encoder in a 2. https://github.com/DeepSpectrum supervised manner. In the context of SER, domain adapta- 4

102

101 Sample Duration [s]

0 AD BES CVE DES ABC MES SIMIS MELD CASIA EU-EV SUSAS GEMEP DEMoS PPMMK EA-WSJ EA-ACT EmotiW GVEESS EMO-DB EA-BMW EmoFilm FAU Aibo IEMOCAP SmartKom TurkishEmo eNTERFACE

Fig. 2: Boxplot of sample durations for each EMOSET corpus. Boxes show the inner-quartile range (IQR) and the whiskers extend to a maximum of 1.5 × IQR measured from the lower and higher quartiles. Black dots are considered as outliers. For readability, the scale of the y-axis is logarithmic from 101 upwards. tion has been investigated for the purposes of increasing multi-domain model is trained, but families of networks that performance on domains where labelled data is sparse or are highly related to one another, sharing a large portion of performance it hindered by domain shifts. Ramakrishnan their parameters while still containing task specialised mod- et al. [81] used supervised domain adaptation to adapt SVM ules. In their work on the Visual Decathlon [27] challenge, classifiers from two different source domains (IEMOCAP the authors experimented with different placements of and and SEMAINE) to RECOLA. Hassan et al. [82] further im- configurations for the adapter modules. They found that for proved on the best challenge submissions for FAU Aibo the best performance, adapters should be used throughout Emotion Corpus (FAU Aibo) by mitigating the domain shift the whole deep network in a parallel configuration. In caused by the two different recording sites of samples in the general, this method reaches or even surpasses traditional database via importance weighting for SVMs. finetuning strategies while only requiring about 10 % of task specific parameters. Due to the highly promising results on While most of the work done for domain adaptation different domains of visual recognition and the capabilities focuses on transferring models and feature representations of CNNs for various speech recognition tasks – as touched from a single source domain to a new target domain, the set- upon in Section 2.3 and Section 2.2 – the model is adapted ting of multi-corpus SER as found in this manuscript is dif- for multi-corpus SER for this manuscript. ferent. Due to unavailability of truly large-scale databases in the field, an approach that is able to learn from multiple do- 3 EMOSET -COLLECTIONOF SPEECH EMOTION mains at once and in the process increases performance for individual tasks is more desirable. Following work into the RECOGNITION CORPORA area of domain adaptation training, the model of residual For this manuscript, a large collection of 26 emotional adapters was first introduced in [26] for the task of multi- speech corpora has been assembled and organised. This domain learning in the visual domain. The approach tries collection, from hereon EMOSET, contains published SER to find a middle ground between using large pre-trained databases that have been used for research and experi- networks as feature extractors and finetuning the networks ments in the field. Additionally, some unpublished speech on a new task. While using pre-trained networks as feature databases of the Chair of Embedded Intelligence for Health- extractors might have performance drawbacks, finetuning care and Wellbeing (University of Augsburg) are used to the whole network is very parameter inefficient and can further augment the training data in order to improve easily lead to catastrophic forgetting of the model’s pre- the generalisation capabilities and minimise the overfitting training. Instead, in their approach, not a single universal problems of deep learning models. The individual datasets 5 TABLE 1: Statistics of the EMOSET databases in terms of number of speakers (Sp.) and classes (C.), mean sample duration in seconds (mean dur), total duration (total d) and spoken language(s).

name # # C. # Sp. Language mean dur [s] total d [h] Airplane Behaviour Corpus (ABC) [83] 405 5 8 German 10.6 ± 4.7 1.19 Anger Detection (AD) (cf. Section 3.1) 660 2 9 German 10.5 ± 2.2 1.93 Burmese Emotional Speech (BES) [84] 414 6 6 Burmese 1.3 ± 0.5 0.15 CASIA [85] 1200 6 4 Mandarin 1.9 ± 0.6 0.64 Chinese Vocal Emotions (CVE) [86] 874 7 4 Mandarin 1.7 ± 0.4 0.40 Database of Elicited Mood in Speech (DEMoS) [87] 9 365 7 68 Italian 2.9 ± 1.3 7.40 Danish Emotional Speech (DES) [88] 419 5 4 Dutch 4.0 ± 5.7 0.47 EA-ACT [89] 2 280 7 39 English 2.8 ± 1.7 1.80 French German EA-BMW [89] 1 424 3 10 German 1.9 ± 0.5 0.76 EA-WSJ [89] 520 2 10 English 4.7 ± 1.5 0.69 Berlin Database of Emotional Speech (EMO-DB) [90] 494 7 10 German 2.8 ± 1.0 0.38 Emotion in the Wild 2014 (EmotiW 2014) [91] 961 7 - - 2.4 ± 1.0 0.65 eNTERFACE’05 Audio-Visual Emotion Database (eNTERFACE) [92] 1 277 6 42 English 2.8 ± 0.9 1.00 EU-Emotion Voice Database (EU-EV) [93] 3 148 21 54 English 1.9 ± 0.8 1.70 Hebrew Swedish EmoFilm [94] 1 115 5 207 English 2.1 ± 1.0 0.64 Italian Spanish FAU Aibo [95] 17 074 5 51 German 1.7 ± 0.7 8.25 Geneva Multimodal Emotion Portrayal (GEMEP) [96], [97] 1 260 17 10 French 2.4 ± 1.0 0.85 Geneva Vocal Emotion Expression Stimulus Set (GVEESS) (Full Set) [98] 1 344 13 12 Made-up 2.6 ± 1.1 0.96 IEMOCAP (4 classes) [99] 5 531 4 10 English 4.6 ± 3.2 7.00 Multimodal EmotionLines Dataset (MELD) [100] 13 707 7 6+ English 3.2 ± 4.0 12.10 Mandarin Emotional Speech (MES) [84] 360 6 6 Mandarin 1.8 ± 1.0 0.18 PPMMK (cf. Section 3.2) 3 154 4 36 German 2.5 ± 1.3 2.16 Speech in Minimal Invasive Surgery (SIMIS) [22], [101] 9 299 5 10 German 2.3 ± 1.7 5.80 SmartKom Multimodal Corpus (SmartKom) [102] 3 823 7 79 German 6.7 ± 7.0 7.10 Speech Under Simulated and Actual Stress (SUSAS) [103] 3 593 4 7 English 1.0 ± 0.4 1.00 TurkishEmo (cf. Section 3.3) 484 4 11 Turkish 2.3 ± 0.5 0.31

had to be structured for training. Here, speaker-independent 3.3 Turkish Emotion training, development, and test partitions had to be con- This Turkish SER corpus contains 484 samples of emotional structed manually, in the case they did not already exist. speech recorded from 11 subjects (7 female and 4 male) An overview of each corpus can be found in Table 1. For covering the basic classes anger, joy, neutrality, and sadness. published corpora, descriptions can be found in the refer- The samples have an average duration of 2.3 seconds. Two enced papers whereas unpublished databases are described female and two male speakers’ samples are used set aside below. These corpora all include categorical emotion labels. for the development and test partition. Overall, there are 84 161 samples with a total duration of 65.6 hours and the mean duration of all audio recording is 2.81 seconds. 3.4 Exploratory Data Analysis

While a number of statistics for each EMOSET corpus can be found in Table 1, further data exploration can lead to higher 3.1 Anger Detection (AD) level insights into the nature and composition of EMOSET. Anger Detection (AD) is a corpus of angry and neutral How much data is provided by each individual database, speech recorded in the setting of phone calls. The corpus both in terms of number of samples and total duration is split over 9 calls, each containing speech segments for of audio content, is highly variable. Here, a few datasets both classes. In total, there are 660 samples with an average stand out by including a comparatively large amount of duration of 10.5 seconds. Two calls are separated from the data. DEMoS, FAU Aibo, IEMOCAP, MELD, and SIMIS all training data, one for the development partition and one for contain more than 5 000 samples. When looking at these the held-out test set. corpora from the perspective of their total duration, the variability of sample lengths shows its effect. While SIMIS includes more samples than IEMOCAP (almost twice as 3.2 PPMMK-EMO many), they are a lot shorter on average, leading to a smaller combined corpus duration. A similar picture is found for PPMMK-EMO is a database of German emotional speech FAU Aibo and MELD, where the latter contains very long recorded at the University of Passau covering the four basic samples while the former’s speech recordings are shorter classes angry, happy, neutral, and sad. It has a total of 3 154 on average. samples averaging 2.5 seconds in length recorded from 36 In this context, it is further of interest to take a look at speakers. For the test set, 8 speakers’ recordings are set the distribution of the length of audio recordings in the aside, while 20 % of the training data is randomly sampled whole EMOSET database. Figure 3 shows a histogram of to form the development partition. all sample durations from 0 to 10 seconds – there are a few 6 much longer samples, but they are excluded for keeping the A very long sample which has been annotated with only histogram readable. From the plot, it can be seen that most one emotional category as a whole might not exhibit this of the samples are between 1 and 5 seconds long, i. e. , when emotion uniformly across its entirety. This can introduce choosing a window size for emotion classification later on, noise into the learning process which might limit the 5 seconds should be enough to capture a sufficiently large learning capabilities of the investigated models. portion of the the input samples in their entirety. Observations about the samples in each dataset regard- ing their duration can be made on a more granular level 7000 from Figure 2 which shows a boxplot for each of the 26 corpora. Here, most datasets exhibit relatively narrow du- 6000 ration ranges in the order of a few seconds. However, the 5000 variability of sample durations depends on the recording setting and nature of the databases, with acted and scripted 4000 emotion portrayals having a very narrow inner-quartile 3000 range while natural or induced emotional speech samples vary more heavily in their length. Thus, it is interesting to # EmoSet Samples 2000 look at the databases which stick out from the pack. First of 1000 all, ABC has samples that are 10 seconds long on average but 0 shows a large inner-quartile range and minimum durations 0 2 4 6 8 10 of under 2 seconds. The variability here can be explained by Sample Duration [s] the nature of the dataset. ABC contains unscripted induced emotional speech recorded in the simulated setting of an air- Fig. 3: Histogram of sample durations in EMOSET. Most of the plane. Emotional speech segments were annotated and ex- samples are between 1 to 5 seconds in length. tracted by experts after the recording, leading to no bounds on sample lengths. Similarly for AD, the speaker turns of angry and neutral speech during phone conversations are generally longer than for other datasets at 10 seconds. This 4 BASELINES seems reasonable considering that in phone calls, reliance Two baseline systems using unadapted on the actual content of speech is greater to convey one’s DEEP SPECTRUM and eGeMAPS [104] features for emotion intentions than in face to face conversations, where addi- recognition are included to compare the performance of tional information is transported through gestures and facial transfer learning models trained on EMOSET. Additionally, expressions. DES can be seen as a further exception here. for the experiments with a residual adapter CNN model, a While the corpus contains only acted and scripted emotion ResNet is trained from scratch on each of the tasks. portrayals, the fact that there are different types of spoken content – short utterances, sentences and fluent passages 4.1 eGeMAPs of text – leads to a very large range of sample durations. The first baseline system uses the extended version of the The popular benchmark dataset IEMOCAP also shows high Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) variability in this context for another reason. In the database, extracted with the help of openSMILE. Afterwards, zero there are both scripted and improvised sessions of conversa- mean and unit standardisation is applied to the feature tions between professional actors. While sample durations sets. Here, the normalisation parameters are learnt on the for scripted portrayals can be kept quite consistent in length, training partitions and then fixed to later be applied to speaker turns in natural conversations do not conform to the validation and test splits. A linear SVM is then trained any restrictions. MELD which consist of scenes from the for the individual classification tasks. Its complexity pa- popular TV series ‘Friends’ sitcks out by containing very rameter is optimised on the validation partitions of each long samples – some well over a minute long while finally −9 EMOSET corpus on a logarithmic scale from 1 to 10 . With SmartKom shows the highest variability in duration from optimised parameters, the SVM is fit again on the combined only a small fraction of a second to minute long recordings training and development splits and then evaluated on the captured from the test subjects during interaction with a test partition of each dataset. simulated human-machine interface. For the implemented deep learning methods, sample duration variability can be a problem. The 4.2 Deep Spectrum DEEP SPECTRUM model needs a fixed length inputs for The second baseline is constructed from the extracting feature representations from its fully connected DEEP SPECTRUM system. Features are extracted from layer, resizing the generated spectrograms might lead Mel-spectrogram plots (128 Mel-filters) using VGG16 to a loss in information and make it harder to learn pre-trained on ImageNet as feature extractor. The emotionally salient information across databases. For this plots use the magma colourmap to convert the 2D reason, the approach works with fixed windows of 1 spectrogram representation to an RGB colour-coded image second. While the 2D attention model used in the ResNet (see Section 5.1 for more details). No windowing of the based approach enables variable size input spectrograms, input audio samples is applied, i. e. , Mel-spectrogram duration differences might lead to problems here as well plots are created from the full duration of each sample. – and similarly for the fixed window in DEEP SPECTRUM. Afterwards, this image is resized to 224 × 224 to conform 7

with the training images of the ImageNet database. As EMOSET sample, but some datasets contain samples which with the eGeMAPs features, a linear SVM classifier is used are longer than 30 s. In the case of such a sample, a random with the DEEP SPECTRUM features and its parameters are 5 s chunk is chosen for feature extraction. As the extraction optimised as in Section 4.1. is done on the fly and on a GPU, this serves as a form of augmentation and should help against overfitting. Shorter 4.3 ResNet from Scratch samples are converted to spectrograms as they are, and zero-padding is only performed on a per-batch basis to the The performance of transfer experiments with the residual maximum sample length within this batch. adapters model will be compared against an identically structured ResNet which has been trained from scratch 5.2 Parallel Residual Adapters on each of EMOSET’s corpora, individually. The descrip- The second Deep Learning model investigated for this tion of the model’s architecture can therefore be found manuscript is built upon the concept of residual adapters, in Section 5.2. The model is optimised in batches of 64 by as described in Section 2.4. A residual CNN based on the Stochastic Gradient Descent (SGD) with a momentum of popular Wide ResNet [107] model with the parallel adapter 0.9 and a learning rate decay of 10−6 to the weights of all configuration as used in [26] is trained on EMOSET. layers. Further, class weighted cross entropy is used as loss to counteract the class imbalance in many datasets, and a l2 5.2.1 Architecture regularisation loss term is added with factor 10−6. As batch All ResNet models trained on EMOSET contain three sub- normalisation (BN) is used throughout the network, the modules (stacks) of residual blocks chained in sequence. training starts with a large initial learning rate of 0.1 which Before those submodules, however, an additional convolu- is exponentially decayed by factor 10 when no improvement tional block is applied. This block consists of a convolutional in validation Unweighted Average Recall (UAR) occurs for layer with 32 filters of size 3 × 3, which are convolved with 50 epochs. This learning rate decay step happens twice and the input using a stride of 1. Additionally, BN is applied afterwards the model is trained until no UAR improvement to the outputs of the convolution. Afterwards, the output is is seen for another 50 epochs. fed through the submodules. The number of filters in the convolutional layers within these submodules doubles for 5 DEEP LEARNING ARCHITECTURES each consecutive one. The first one uses 64, the second 128, 3 × 3 For the transfer learning experiments, two deep learning ar- and the third one 256 filters of shape . Each of the chitectures are considered. An ImageNet pre-trained VGG16 modules contains two residual blocks, where the blocks are structured as follows. In the very first block, a convolutional network as used in the DEEP SPECTRUM system is finetuned layer with a stride of 2 is applied to the input followed on EMOSET and the approach of parallel residual adapters as described in Section 2.4 and previously used for multi- by BN and a Rectified Linear Unit (ReLU) activation. A domain visual recognition is adapted and evaluated on the second convolution, this time with a stride of 1, is placed directly afterwards, its output again batch normalised. As multi-corpus SER problem posed by EMOSET. the number of filters in the block’s first convolution in- creases compared to the input received from the very first 5.1 Preprocessing convolution in the network (from 32 to 64), a shortcut For the transfer learning experiments, the input audio con- connection is needed to add the residual (input to the block) tent of the SER data is transformed and pre-processed. back to the output received after the two convolutions. This Mel-Spectrograms are derived from the spectrogram au- is done by first applying average pooling with patches of dio representation. Typically, for speech recognition tasks 2 × 2 and stride 2 to the block input and concatenating this a window size of around 25 ms is chosen [22], [52] for the pooled residual with zeroes along the filter dimension. The fast Fourier transform (FFT). For efficient computation, it resulting residual is then added to the output of the second is best to chose a power of two (in number of samples) as convolution in the block and a ReLU activation is applied. window size, i. e. , an FFT window with 512 samples leads The second block differs only in that it uses a stride of 1 to a close 32 ms for the EMOSET databases that have a fixed for both of its convolutional layers and further has no need (resampled) sampling rate of 16 000 Hz. The windows are for the shortcut connection as the number of filters does further shifted with a hop size of 256 samples along the not increase. For the other two submodules, it works the input audio signal, leading to a window overlap of 50 %. same but the number of filters are increased to 128 and 256, Mel-Spectrograms apply dimensionality reduction to the respectively. As for the first module, only the first block in log-magnitude spectrum with the help of a mel-filter. In the each of the other two modules has convolutions with stride transfer learning experiments with residual adapter models, 2 and needs a shortcut connection. A final BN and ReLU 64 mel-filters are chosen, as it has been found that a larger activation are placed after the last block of the third module. number of filters often leads to worse results when used Instead of the standard global pooling and fully con- as input for a CNN [105], [106]. The spectrograms do not nected layers applied on top of the convolutional feature have to be resized, as the usage of a two dimensional extractor base, a 2D self attention layer as proposed in [105], attention module, which is attached to the fully convolu- [106] is utilised. This layer projects the variable size spatio- tional feature extractor base, enables handling of variable temporal output of the CNN to a fixed length weighted sum size input. Nevertheless, an upper bound of 5 s is set on representation. By doing so, the weights are learnt based on the length of the audio segments from which spectrograms the emotional content that can be found in a specific time- are extracted. This is well above the average duration of an frequency region of the spectrogram. The computation of 8

+0.0 dB +0.0 dB +0.0 dB +0.0 dB

-20.0 dB 4.096 4.096 4.096 -20.0 dB 4.096 -20.0 dB -20.0 dB

-40.0 dB 2.048 2.048 2.048 -40.0 dB 2.048 -40.0 dB -40.0 dB

-60.0 dB -60.0 dB 1.024 1.024 1.024 1.024 -60.0 dB -60.0 dB -80.0 dB Frequency [kHz] Frequency [kHz] Frequency [kHz] Frequency [kHz] -80.0 dB -80.0 dB 0.512 0.512 0.512 0.512 -80.0 dB -100.0 dB

-100.0 dB -100.0 dB 0.0 0.0 -120.0 dB 0.0 0.0 -100.0 dB 0 2 4 6 0 1 2 0 1 2 3 0 1.2 2.4 3.6 4.8 Time [s] Time [s] Time [s] Time [s] Fig. 4: Sample Mel spectrogram images created from speech recordings of IEMOCAP for each of its four base emotion categories. From left to right: angry, happy, neutral, and sad.

mel spectograms

conv 3x3 batch #f = 32 norm

Residual Stack (nf = 64) Residual Stack (nf = 128) Residual Stack (nf = 256)

2D batch FC ReLU FC softmax Attention norm

Fig. 5: Architecture of the base ResNet model used in the experiments for multi-corpus SER. Three convolutional stacks extract features from the generated mel-spectrogram input. A 2D attention module is then applied to reduce the variable length output of the convolutional base to a single feature vector for further processing by a Multilayer Perceptron (MLP) classifier head. nf specifies the number of filters (#f ) of all convolutions inside a specific residual stack.

the weighted sum works as follows (adapted from [105]): while with λ = 1, no smoothing of the computed attention The output of the convolutional ResNet feature extractor weights is performed. The first case can be considered as base is three dimensional with size Nf × Nt × Nc, where a global average pooling the ResNet output. A value of Nf is the number of frequency bins at the output – for an λ = 0.3 has been found to work well for SER on the input with 64 mel-filters 8 bins remain after the pooling IEMOCAP database by Zhao et al. in [105], [106], and is operations of the chosen architecture have been applied. Nt adopted here as well. Finally, the output of the module is the number of time-steps in the input spectrogram again computes the attention weighted sum of the flattened input downsampled by 8, while Nc is the number of channels vectors x1,... ,N as: (256) of the last convolutional layer in the ResNet. The first N X two dimensions of this representation are now flattened c = αixi. (4) such that a sequence of Nc-dimensional vectors xi with i=1 length N = N · N remains: f t Similar to the weighted sum of fixed size vectors, c has C 256 A = {x1,..., xN} , xi ∈ R . (1) a fixed size of . Compared to standard global average pooling, the attention layer introduces trainable parameters Each of these vectors is then transformed into an interme- that aim to learn the importance of a specific time-frequency diate learnt representation by a fully connected layer with region’s features to the training task at hand. Further, this 256 units and hyperbolic tangent activation. Afterwards, an module allows for training with variable length audio se- importance vector is calculated by the inner product of the quences. A stack of fully connected layer, BN and dropout is layer output and a learnable vector u, again of size 256: finally put on top of the attention module before the softmax T classification layer. The attention layer can contain corpus ei = u tanh(Wxi + b). (2) specific parameters or be shared across datasets while the The attention weight for each individual vector is then fully connected stack always serves as a domain specific computed as a smoothed softmax over all eis: classifier, i. e. , is not shared between databases, such as the adapter modules. Figure 5 gives a visual overview of the exp (λei) αi = . (3) whole model. PN exp (λe ) k=1 k For transfer learning, residual adapters for each target The factor λ defines a smoothing of the attention weights. database are used throughout the whole depth of the net- If λ = 0, then each of vector gets an equal weight of 1/N, work and are applied in parallel to the shared convolutions. 9

of training. While these parameters are applied for both adapter tuning and training from scratch on each of the corpora, full finetuning and tuning only the classifier head conv 3x3 dropout start training with a reduced learning rate of 0.01. When training on multiple datasets at once, a round- robin approach is applied, sampling one batch of each conv 1x1 dataset. One round is further considered as one step when defining the learning rate schedule. After a fixed number of 2 500 round-robin steps, the learning rate is set to the + according value. After the second step, training continues for an additional 2 500 steps and then stops. After the shared batch norm model training, the adapters and classifier heads are further finetuned for each task in the same way as described above for the single-task transfer experiment. This additional fine- tuning step should help individual models which did not Fig. 6: Depiction of a residual adapter module. The adapter is a reach their global optimum at the very end of the shared task specific small convolution (1 × 1) that is applied in parallel model training procedure. to all convolutions of the shared base model. The outputs of both convolutions are then combined by their elementwise summation. Additionally, the subsequent BN which is not 6 CROSS-CORPUS STRATEGIES shared between corpora is shown in the figure. For training the proposed deep learning models on the EMOSET corpus, two general directions are followed. First, The adapter modules are convolutional layers which all a (mostly-)shared model can be trained jointly on all cor- contain small 1 × 1 kernels. They share the rest of their pora in a multi-domain/task fashion. Second, the individual settings (strides and number of filters) with the correspond- emotion classification corpora can be aggregated into a sin- ing convolution. The output of each adapter is then added gle large corpus by mapping the different available emotion to that of the parallel convolution and fed through the BN categories into a shared label space. layer (cf. Figure 6). Further, all BN layers contain parameters which are trained for each database. Depending on the 6.1 Shared Model Multi-Domain Training number of neurons in the classification softmax layer, the For training a shared model with multiple classifier heads described architecture has around 3 million parameters in on the EMOSET databases simultaneously, batches of sam- total of which 300 000 are domain specific. ples from the individual datasets are sampled sequentially 5.2.2 Training Procedure in a round-robin fashion, only updating parameters specific to this dataset or belonging to the shared model. In addi- The experiments can be divided into two parts. For the tion to the classifier heads, corpus specific residual adapter first set of experiments, the transfer capabilities of a ResNet modules are introduced in parallel to the shared network’s with adapter modules which are trained on a single main convolutions. In practice, this means that a training batch task is evaluated for all tasks in EMOSET. All parameters belonging to a specific EMOSET domain is used to update of the base model are trained on either DEMoS, IEMOCAP, all of the shared model’s parameters in addition to the FAU Aibo, or GEMEP. Subsequently, the pre-trained model respective domain specific adapters and classifier head. is transferred by freezing all parameters apart from the task specific classifier head and adapters. Afterwards, the re- maining parameters are trained on the new task. The second 6.2 Aggregated Corpus Training via Arousal Valence set of experiments takes multiple corpora into account for Mapping training the base model. Here, a model is constructed which The second way of training in the multi-corpus SER setting includes adapter modules and classifier heads for each of considers ways of aggregating the individual corpora into a the training tasks while sharing the rest of the parameters. larger shared database. As all databases deal with emotion After the training process is finished, the shared parameters classification, a suitable approach is to map the annotated are frozen and transferred to a new task from EMOSET , emotion categories into a shared label space for training. A while adapters and classifier are reinitialised. fitting mapping is to use a dimensional approach and trans- For both kinds of experiments, the following hyper- form the multi-class problems into binary and three-way parameters were chosen for the training procedure. The arousal/valence classification. As described in Section 2.1, a models are optimised via SGD with momentum of 0.9 in method for this mapping is to use a model, such as the cir- batches of 64 examples. Further, as BN is used throughout cumplex of affect [36]. In Table 2, such a mapping has been the model, the initial learning rate is set to a high value of performed for the emotions included over all of EMOSET’s 0.1 and exponentially decayed by a small factor of 10−6. databases. In order to prevent the model from learning the Finally, the learning rate is reduced in three steps from 0.1 class distributions of the EMOSET corpora, random subsam- to 0.01 and ultimately to 0.001. For the single-task transfer pling is performed on a per dataset basis. For each corpus, experiments and training the baseline models, training is the number of samples for each of the mapped classes, i. e. halted and continued with a smaller learning rate if no , the two arousal and three valence levels, is equal to the increase in validation UAR has occurred for 50 epochs respective sample count of the minority class. This is applied 10 TABLE 2: Mapping of classes contained inside the EMOSET databases onto six classes of arousal and valence combinations. Categories are mapped as eliciting either low or high arousal. For valence, negative, neutral, and positive emotions are categorised.

low A high A negative V neutral V positive V negative V neutral V positive V contempt boredom admiration aggressiveness emphatic amusement disappointment confusion kindness anger interest cheerfulness disgust neutral pride anxiety intoxication elation frustration pondering relief despair medium-stress excitement guilt rest tenderness fear nervousness happiness hurt sneakiness helplessness surprise joking impatience high-stress joy irritation scream pleasure jealousy positive sadness shame unfriendliness worry

on training, validation, and testing partitions of each dataset by which it does differ. Large increases over the other two separately. Training with the aggregated corpus approach baselines can be found on ABC, CVE, and GEMEP with can then proceed as in standard single dataset network a delta of around 10 percentage points compared to the training, considering samples from all datasets as input second best approach. Especially GEMEP with its 17 anno- and learning to predict the arousal or valence categories tated classes seems to benefit from using a small higher-level of the joint mapped label space. Finally, a combination of feature set in combination with an SVM classifier. Slight in- the multi-domain and the aggregated corpus approach can creases from the other two baselines occur for EmotiW 2014, be taken for training a single network on both arousal and FAU Aibo, IEMOCAP, MELD, SIMIS, and SmartKom, while valence mapped corpora, i. e. , the domains are represented performance for the EA datasets (EA-ACT, EA-BMW, and by the respective arousal and valence aggregated corpora. EA-WSJ), SUSAS, and Turkish Emo is on par with the other baseline method. Surprisingly, DEEP SPECTRUM and the ResNet baseline both achieve considerably better results 7 EVALUATION particularly on the test partition of EMO-DB. This might The results for all baselines and each dataset in EMOSET are be caused by the optimisation procedure of the baselines briefly analysed in Section 7.1. Afterwards, results of the utilising SVMs as classifiers. The complexity parameter is transfer learning experiments with the residual adapters optimised from a fixed logarithmic set of values based on (in Section 7.2), are discussed. the performance on the development partition, which might not always be the best value to choose for performing the 7.1 Baseline Results fit on the combined training and development partitions for evaluation on test. Three baseline systems are tested for their efficacy on each corpus of the EMOSET separately: i) the 7.1.2 Deep Spectrum eGeMAPS SVM combination (cf. Section 7.1.1), ii) the With a few exceptions, combining DEEP SPECTRUM features DEEP SPECTRUM feature extraction with linear SVM (cf. Sec- with a linear SVM classifier leads to results compara- tion 7.1.2), and iii) the ResNet architecture trained from ble to the eGeMAPS baseline method. Negative examples scratch on the mel-spectrograms of each dataset (cf. Sec- can be found with ABC, DES, and GEMEP, where the tion 7.1.3). DEEP SPECTRUM system falls behind both other baselines. The results of these methods can be found in Table 3. On GVEESS, on the other hand, it achieves the best test set From a high level view of these results, it can be seen that result of 27.9 % compared to the runner-up with 24.7 %. For – among those baselines – there is no overall best approach the EA datasets, EU-EV, and EmoFilm it matches perfor- to solving the SER classifications for every dataset. Rather, mance with eGeMAPS. In the case of DEMoS, the system depending on the corpus, the best achieved result can be lags behind the ResNet trained from scratch considerably found in any of the models. Furthermore, results for some but improves on the eGeMAPS baseline. This might be corpora fluctuate quite heavily, which becomes especially explained by the larger size of the DEEP SPECTRUM features apparent by looking at the two presented types of results which can provide more discriminative features for the SVM of the ResNet trained from scratch on each corpus. These classifier given enough training samples. When looking two points are discussed in the following sections where at BES and MES – two datasets that only differ in the appropriate. recorded subjects and their spoken language – it can be seen that the system has problems consistently handling 7.1.1 eGeMAPS small datasets with a few number of speakers: In the case As an expert-designed, handcrafted acoustic feature set of BES, DEEP SPECTRUM achieves the best result on the test specifically intended for paralinguistic speech analysis set of all the investigated baselines, while it is considerably tasks, eGeMAPS is a competitive baseline for many of the less performant on MES. The nature of these datasets also included corpora in EMOSET. While it achieves the best has an impact on the other approaches, but is overall more results on the test partition of 12 databases, the margins pronounced here. 11 TABLE 3: Baseline results in UAR for the different approaches: A ResNet trained on each EMOSET database individually, an eGeMAPS + SVM and a DEEP SPECTRUM + SVM baseline. For the ResNets trained from scratch, two different sets of results are presented: First, the evaluation results for the best training epoch (measured in development UAR) can be found in the first columns. Secondly, the results achieved at the very end of training are reported in the last columns. Further, for each dataset the chance level is given (in UAR.

[ %] ResNet eGeMAPS DS best final Dataset chance devel test devel test devel test devel test ABC 25.0 44.9 41.9 42.7 39.4 47.0 54.4 45.2 33.5 AD 50.0 86.1 71.4 82.4 72.9 88.5 78.7 82.5 71.9 BES 16.7 65.0 55.0 63.3 53.3 53.3 58.3 55.0 63.3 CASIA 16.7 36.3 18.7 29.3 23.3 33.7 24.7 26.3 31.3 CVE 14.3 35.6 30.3 30.9 34.2 58.9 60.4 48.5 52.2 DEMoS 14.3 89.0 73.8 88.9 73.6 38.4 43.2 53.0 46.9 DES 20.0 34.7 43.3 21.9 52.6 31.0 46.5 23.0 34.4 EA-ACT 14.3 32.9 35.7 12.9 50.0 22.9 57.1 24.3 57.1 EA-BMW 33.3 73.4 46.7 71.4 56.3 79.8 56.3 61.1 59.3 EA-WSJ 50.0 100.0 98.1 100.0 98.1 100.0 100.0 98.1 98.1 EMO-DB 14.3 68.8 59.2 63.5 61.6 71.9 48.4 52.5 61.3 eNTERFACE 16.7 71.5 81.0 71.1 82.4 44.8 49.1 50.0 46.2 EU-EV 5.6 12.7 10.4 6.3 11.1 7.8 10.9 8.1 11.0 EmoFilm 20.0 46.3 46.3 44.4 47.0 46.9 54.6 46.8 54.7 EmotiW 2014 14.3 28.1 24.1 24.0 24.3 30.5 30.6 33.3 29.2 FAU Aibo 20.0 53.3 37.2 51.7 36.9 46.2 38.3 43.8 36.8 GEMEP 5.9 46.0 29.1 44.6 29.8 41.1 35.4 35.2 22.4 GVEESS 7.7 26.9 24.7 17.8 20.4 38.5 21.2 32.7 27.9 IEMOCAP 25.0 52.7 53.6 49.6 57.6 56.2 58.4 54.4 57.7 MELD 16.7 24.8 21.7 22.4 20.2 23.7 23.1 23.8 22.2 MES 16.7 66.7 75.0 56.7 70.0 61.7 61.7 38.3 43.3 PPMMK 25.0 70.1 40.6 69.5 39.3 48.5 40.4 58.6 40.8 SIMIS 20.0 40.7 30.5 38.7 24.7 38.5 31.7 33.3 29.5 SmartKom 14.3 20.2 20.8 11.3 23.3 25.2 28.6 25.7 26.2 SUSAS 25.0 64.2 56.5 54.2 59.4 56.6 56.5 54.6 59.3 TurkishEmo 25.0 62.5 55.7 56.8 58.0 64.8 56.8 54.5 53.4

7.1.3 ResNet from scratch to measure the model’s performance and generalisation capabilities during training. Furthermore, with the help of Results for the last baseline against which transfer learning the validation data learning rate adjustments are defined experiments with residual adapters are evaluated were ob- and the training is stopped early before performing the tained by training a ResNet with architecture and training final evaluation on the held-out test set. Thus, the samples settings as described in Section 5.2.2 on each individual task. in the validation partition have no immediate impact on A first set of development and test results restore the model the weights of the trained model which might limit the which achieved the best UAR on the development partition generalisation capabilities especially for datasets containing for evaluation on the held-out test set. For the second type of only a few number of samples or speakers. results , the development and test set UARs achieved at the As the validation and testing splits contain different very end of the training are reported. The reason for report- sets of speakers, cases exist where validation and testing ing both of those sets of values is motivated by the nature performance diverge heavily. For example, for DES, the of the training process and the databases. Training of the model achieves a test set UAR of 43.3 % against the overall networks, as described in Section 5.2.2 always starts with a best development performance of 34.7 %, but at the end high learning rate which is reduced in steps after validation of training, this discrepancy increases to 52.6 % on test, performance has not increased for a certain patience period. and only a mere 21.9 % on development which is near As many datasets contain only a few hundred samples, this chance-level. A very similar picture is observed for EA- can lead the model to explore into an area of the learning ACT, where development and test performance diverge to landscape which – by chance – results in very good results 12.9 % against 50.0 %. The performance on these datasets on the development partition that do not correspond to is further hindered by the fact that they only contain a similar results on the test partition very early in the training small number of training and validation samples. Also, a process. If by the end of training, better performance on the few of the datasets can be identified as being challenging development partition has not been achieved, this very early for the ResNet model in general, such as EU-EV, which is model is used for the final evaluation on the test partition. a corpus with a very large number of annotated emotion All results can be found alongside the other two base- classes in three different languages. In addition, the train- lines in Table 3. In the other two baseline approaches, the ing and evaluation setup chosen for EMOSET once more usage of a linear SVM has an advantage over the neural increases the difficulty by partitioning based on language. network approach, in that the combined training and vali- In the end, UARs of around 10 % are achieved by the dation data can be used to perform a final fit of the classifier model if only the corpus itself is used for training which after having found the optimal training parameters. In the is in line with the other baselines. For EmotiW 2014, the ResNet experiments, however, the validation data is used challenge lies in its multi-modal nature. The corpus addi- 12

tionally contains visual content, which is immensely helpful corpora are included which contain either a large amount in the identification and discrimination of emotions through of training samples (DEMoS, and IEMOCAP) or annotated the analysis of facial characteristics. As could already be classes (GEMEP). seen in the baseline using eGeMAPS features and a linear For brevity, we leave out the detailed results of these SVM classifier, information extracted from only the audio transfer experiments and only summarise our findings. content does not lead to favourable results for this cor- When comparing against a ResNet that is trained on the pus. On both SIMIS and SmartKom, the ResNet achieves target corpus from scratch, adapter tuning from a model only very weak results which are slightly above chance- trained on a single source corpus achieved mixed results. level. Here, DEEP SPECTRUM and eGeMAPS perform better. While for some, especially smaller tasks, such as BES For eNTERFACE on the other hand, this approach sub- or Speech Emotion Database of the Institute of Automation stantially outperforms the DEEP SPECTRUM system reach- of the Chinese Academy of Sciences (CASIA) increases in ing a test set UAR of over 80 %. Furthermore, for this UAR could be observed, in most cases, performance on dataset, the UARs at the best and final model checkpoint both development and test partitions stayed roughly the are consistent, i. e. , the model converged to an optimum. same. A noteworthy negative example was eNTERFACE, Especially weak performance can be found on CVE, with for which the performance drops noticeably compared to the ResNet trained from scratch falling behind the other training a full model on only the target data. A model two baselines by more than 20 % UAR when compared trained exclusively on this dataset is able to achieve a UAR to eGeMAPS, and 15 % against DEEP SPECTRUM on both of 81 % while the best result for the transfer experiments development and test partitions. Having only 4 speakers in was more than 5 percentage points below that at 75 % total, this is another dataset where the danger of overfitting UAR. On the other hand, SmartKom, a large but difficult to the training data is especially high for a deep learning corpus of natural emotional speech, seemed to benefit from model that is trained from the raw audio content. Here, pre-training and adapter transfer in all cases. Nevertheless, the eGeMAPS and DEEP SPECTRUM have the advantage of these results were encouraging when observed from another utilising abstracted feature representations in the form of perspective. They showed that the features learnt by the an expert-designed, hand-crafted audio descriptor set and ResNet model from Mel-spectrograms of one SER corpus high-level image representations learnt from the task of ob- can be used reasonably well for a large range of SER tasks by ject recognition. The strongest ResNet result is achieved for simply introducing a small amount of additional parameters DEMoS with a test set UAR of 73.8 % which is around 30 % – the adapter modules. above the other baseline methods. This can be explained In addition to comparisons against training models on by the partitioning of the dataset for EMOSET which has each task from scratch, the residual adapters approach separated non-prototypical from prototypical emotion por- should be related to more traditional finetuning strategies. trayals in a speaker dependent way. As the ResNet approach The pre-trained models can be taken as feature extractors learns most directly from the raw input data, fine-grained and only the last classification layers are re-trained for emotionally discriminative features for each speaker can each task. This method tunes an even smaller amount of be learnt from the low-level spectrogram representation. parameters for each corpus, further decreasing the risk of Adding layers of abstraction to the feature representation overfitting but having decreased learning capabilities. in the eGeMAPS and DEEP SPECTRUM baselines hides away A comparison between adapter and classifier head tun- this information. ing is made in Figure 7a. Apart from very few exceptions, it can be seen that adapter tuning beats the feature-extraction transfer approach in terms of UAR both on the develop- 7.2 Parallel Residual Adapters ment and test partitions. Notable outliers can be found In the case of the parallel residual adapter models trained on with CASIA and CVE hinting at possible overfitting with EMOSET, experiments and their evaluation are proceeded in the adapters approach. Moreover, for datasets with a large a slightly different way. In a first set of experiments, denoted number of classes, e. g. , GVEESS or EU-EV performance on as “single-task transfer”, four tasks out of EMOSET were test can vary greatly from run to run, leading to classifier chosen as base tasks for pre-training the deep learning archi- head tuning sometimes outperforming the adapter tuning. tectures while for “multi-task transfer”, all EMOSET corpora are used to pre-train a shared model. In both cases, after- 7.2.2 Multi-Task Transfer wards, a transfer experiment is run for each EMOSET task The second set of experiments using the residual adapters by training only the adapter modules on the individual train a shared base network for all EMOSET datasets while tasks data. Finally, the performance of the transfer learning only the adapter modules and final classification layers approach is evaluated by comparing the achieved test set are specific to each dataset. In this respect, every task has UAR to that of a model with same architecture trained from influence on the weights of the shared feature extraction scratch on the specific task, and to more traditional transfer base while still containing task specific parameters to ac- approaches, i. e. , full finetuning and classifier head tuning. count for inter-corpus variance. As sharing the 2D attention layer between different corpora was found to have only 7.2.1 Single-Task Transfer a minor impact on performance in the initial single-task The EMOSET corpora DEMoS, FAU Aibo, GEMEP, and transfer experiments (cf. Section 7.2.1), for multi-domain IEMOCAP are chosen as base pre-training tasks based on training, this module is further contained in the shared base a number of qualitative and quantitative characteristics: model. Here, the performance differences between models With DEMoS, IEMOCAP, and GEMEP, three acted SER trained from scratch and adapted from a pre-trained shared 13

Devel Test Devel Test

ABC 56.5 53.4 50.1 52.7 41.1 36.4 27.3 45.5 ABC100.0%53.8 58.0 56.8 54.9 38.4 35.8 37.5 34.7 100.0%

AD 85.8 81.4 84.6 86.5 71.2 69.6 65.2 72.2 AD 85.4 88.1 84.0 88.9 72.3 72.2 61.3 68.2

BES 58.3 56.7 61.7 60.0 61.7 56.7 60.0 48.3 BES 68.3 70.0 60.0 65.0 66.7 58.3 60.0 71.7 50.0% 50.0%

CASIA 34.7 38.0 36.3 36.3 29.7 25.3 36.3 31.0 CASIA 36.0 41.0 38.3 35.7 32.0 31.7 37.3 26.3

CVE 31.4 37.5 42.0 38.3 36.0 22.2 38.1 27.6 CVE 48.2 33.0 48.9 39.3 47.6 35.5 34.8 35.8

25.0% 25.0% DEMoS 83.2 83.0 84.5 68.8 66.1 69.6 DEMoS 86.0 84.0 85.0 81.4 73.8 68.7 69.7 63.7

DES 42.3 30.0 24.0 35.8 37.2 43.6 31.8 45.9 DES 22.4 34.4 30.3 29.6 54.0 56.1 46.0 55.6

EA-ACT 34.3 30.0 34.3 30.0 31.4 41.4 31.4 31.4 EA-ACT 22.9 37.1 38.6 41.4 40.0 32.9 51.4 51.4

10.0% 10.0% EA-BMW 62.6 67.1 70.3 65.3 33.3 33.0 40.0 33.3 EA-BMW 67.3 66.3 68.1 66.3 43.3 39.6 50.0 36.7

EA-WSJ 100.0 100.0 100.0 100.0 96.2 98.1 98.1 96.2 EA-WSJ 100.0 100.0 100.0 100.0 98.1 100.0 100.0 100.0

EMO-DB 68.3 62.6 70.5 61.5 64.1 55.2 55.6 54.2 EMO-DB 73.2 61.1 74.7 64.0 72.6 61.1 68.4 64.8

eNTERFACE 70.7 54.1 51.9 55.9 74.8 64.8 59.0 61.4 eNTERFACE 65.2 57.8 61.9 54.8 70.0 62.9 61.9 59.5

EU-EV 13.4 15.2 11.1 11.5 10.1 9.8 6.8 8.3 EU-EV 14.9 11.7 13.0 10.0 10.0 9.5 11.1 9.2

EmoFilm 46.6 45.5 46.0 43.4 43.7 43.6 46.1 46.9 EmoFilm0.0% 47.3 44.1 47.2 42.2 48.2 47.0 45.0 44.5 0.0% transfer_task transfer_task

EmotiW 28.3 30.5 33.8 35.5 21.8 21.2 26.9 25.0 EmotiW 28.7 29.3 29.8 33.2 29.4 17.9 24.4 24.2

FAU Aibo 52.1 51.4 52.7 36.6 35.0 36.0 FAU Aibo 50.1 53.9 54.5 53.1 34.0 34.7 37.4 35.0

GEMEP 44.2 41.6 42.0 28.7 24.5 25.1 GEMEP 50.7 43.3 44.5 41.8 29.1 26.8 29.2 27.3

GVEESS 36.5 33.7 35.6 29.8 17.6 26.4 23.6 16.8 GVEESS 31.7 35.1 34.1 36.5 18.1 21.3 23.9 24.7

IEMOCAP 54.0 52.5 52.6 53.7 53.7 54.4 IEMOCAP 50.4 52.2 51.3 50.3 52.1 53.1 54.3 51.4 -10.0% -10.0%

MELD 25.3 25.8 25.4 23.7 23.1 19.2 22.4 20.9 MELD 24.7 25.6 25.8 24.6 20.2 20.8 20.3 19.4

MES 65.0 56.7 70.0 76.7 68.3 58.3 65.0 68.3 MES 56.7 56.7 50.0 56.7 58.3 56.7 63.3 56.7

PPMMK 70.8 75.0 70.9 72.8 42.4 43.4 39.9 43.7 PPMMK 77.7 70.7 75.0 78.7 47.9 43.4 41.6 44.7 -25.0% -25.0%

SIMIS 40.9 43.2 43.0 44.0 28.1 29.4 30.0 26.3 SIMIS 42.7 44.6 41.7 45.1 28.0 27.5 26.5 26.4

SmartKom 25.4 23.4 25.0 24.8 25.7 31.9 24.5 27.9 SmartKom 23.7 24.5 25.5 26.0 21.7 21.0 24.2 29.3

-50.0% -50.0% SUSAS 59.4 63.9 60.5 52.5 54.3 52.3 56.9 55.1 SUSAS 58.8 61.1 57.6 57.9 53.9 44.9 59.6 42.4

TurkishEmo 63.6 65.9 60.2 73.9 54.5 54.5 54.5 61.4 TurkishEmo 65.9 75.0 69.3 63.6 70.5 52.3 61.4 56.8

all all -100.0% -100.0% A + V A + V GEMEP GEMEP DEMoS DEMoS Arousal Arousal Valence Valence FAU Aibo FAU Aibo IEMOCAP IEMOCAP main_task main_task main_task main_task (a) UARs achieved on the development and test partition of each (b) UARs achieved at the best epoch on the development and test EMOSET corpus when performing adapter transfer learning from a partition of each EMOSET corpus when performing adapter transfer ResNet pre-trained on one of four main tasks. The colours represent learning from a ResNet with a shared 2D attention module pre-trained increases and decreases from classifier head tuning. in different multi-task settings on the EMOSET corpora. The colours of the heatmap represent the increase or decrease from the UARs achieved by a model trained solely on the target corpus percentage points. 14 TABLE 4: Comparison of achieved test set results in the transfer learning experiments with residual adapters utilising all of EMOSET. “All” denotes training a single multi-corpus model wich classifier-heads for every database in a round-robin fashion. McNemar’s test is used to test for statistical difference (p < 0.05) of the proportion of errors made by the transfer models compared to the baseline. “+” denotes a statistically significant improvement, “−” a decrease. Note that a performance increase according to the McNemar test does not correlate with a higher UAR for very imbalanced datasets.

[ %] Dataset chance baseline all Arousal Valence A + V ABC 25.0 41.9 38.4 37.5 34.7 35.8 AD 50.0 71.4 72.2 61.3− 68.2 72.2 BES 16.7 55.0 66.7+ 60.0 71.7+ 58.3 CASIA 16.7 18.7 32.0+ 37.3+ 26.3+ 31.7+ CVE 14.3 30.3 47.6+ 34.8+ 35.8+ 35.5+ DEMoS 14.3 73.8 73.9 69.7− 63.7− 68.7− DES 20.0 43.3 54.0 46.0 55.6+ 56.1+ EA-ACT 14.3 35.7 40.0 51.4 51.4+ 32.9 EA-BMW 33.3 46.7 43.3 50.0 36.7 39.6 EA-WSJ 50.0 98.1 98.1 100 100 100 EMO-DB 14.3 59.2 72.6+ 68.4+ 64.8 61.1 eNTERFACE 16.7 81.0 70.0− 62.0− 59.5− 61.4− EU-EV 5.6 10.4 11.1+ 9.1 9.2 9.5− EmoFilm 20.0 46.3 48.2 45.0 44.5 47.0 EmotiW 2014 14.3 24.1 29.4 24.4 24.2 17.9− FAU Aibo 20.0 37.2 34.0− 37.4− 35.1− 34.7 GEMEP 5.9 29.1 29.1 29.2 27.3 26.8 GVEESS 7.7 24.7 18.1 23.9 24.7 21.3 IEMOCAP 25.0 53.6 52.1 54.4 51.4− 53.1− MELD 16.7 21.7 20.2+ 20.3+ 19.4+ 20.7+ MES 16.7 75.0 58.3− 63.3− 56.7− 56.7− PPMMK 25.0 40.6 47.8+ 41.6 44.7 43.4 SIMIS 20.0 30.5 28.0− 26.5 26.4 27.5− SmartKom 14.3 20.8 21.7+ 24.2+ 29.3+ 21.0 SUSAS 25.0 56.5 53.9− 59.6 42.4− 44.9− TurkishEmo 25.0 55.7 70.5+ 61.4 56.8 52.3 model are further evaluated by performing a McNemar’s to increases in both development and test set UARs from test [108]. Statistically significant (p < 0.05) differences in their near chance level performance when training a full the proportion of errors are marked in Table 4. It should ResNet model on their corpus data alone. For the particu- be noted that, due to the highly unbalanced nature of some larly popular baseline SER corpus EMO-DB, performance is included databases, a performance increase as measured by increased on both the development and test by around 10 %, the test does not necessarily correspond to a higher achieved the same is true for the Turkish emotional speech database. test set UAR. In the following, whenever differences in the Both of these increases are further statistically significant results are described as significant, they are so at p < 0.05. measured by a McNemar’s test. The results for multi-task transfer experiments with a For the choice of training data, training on all of ResNet architecture are visualised in Figure 7b. Four differ- EMOSET in a round-robin fashion seems to lead to the best ent settings are analysed: First, training on all corpora in a results on both development and test partitions. However, multi-domain setting as described in Section 6.1, then, the it seems to be closely followed by training the model on other three settings are given by training on both the ag- the aggregated arousal data alone – suggesting the features gregated arousal and valence mapped data, either together learnt from discriminating arousal across corpora can be (A+V) or separately (A and V). Apart from a couple of effectively tuned for various SER tasks with the help of outliers, model performance for all of EMOSET’s corpora residual adapter modules. As arousal is generally easier to either increases or stays the same compared to training a detect from audio recordings of speech than valence, which full ResNet model for every dataset when using the adapter is more effectively conveyed and perceived from visual transfer approach. Two negative examples are DES and EA- information, such as gestures and mimic, training on the ACT, where performance on the development partition only three class valence problem might have been too difficult ever slightly increases above chance level. Contrary to this, for the network, thus not leading to emotionally salient the performance on the test partition shows an increase for feature representations. Further, when combining arousal these two datasets that is significant in the case of using the and valence aggregated corpora for training, the individual valence pre-trained model. As it is already evident in the strengths of pre-training on either corpus do not seem to be baseline results, this behaviour is most likely due to the cor- complementary. pora only containing a small number of samples and DES’s As evident from Table 4, the multi-task transfer exper- validation and test partition only containing one speaker iments lead to increased results for 21 of the 26 databases each, leading to diverging results. Again, CASIA and CVE included in EMOSET. For 10 corpora, some of these increases seem to benefit from the transfer learning approach, leading are further statistically significant. Only for 2 databases – 15 eNTERFACE and MES – results are always significantly on Mel-spectrogram input [110]. As RNNs are a popular worse. For eNTERFACE this can be explained by the strong choice for SER [53], [57], evaluating the residual adapter performance achieved by a model trained on the corpus approach with these networks in a multi-corpus training from scratch which also beats all of the other considered setting should be considered. Furthermore, both CNNs baselines (cf. Table 3). and RNNs could be modified with adapter modules and then trained simultaneously, combining their high level fea- ture representations. For single-corpus SER without adapter 8 CONCLUSIONAND OUTLOOK modules this has been done in [105] with state-of-the-art In this manuscript, we presented a novel deep learn- results. Moreover, so far only SER corpora with categorical ing based, multi-corpus SER framework – EMONET – that labels have been considered. Using a multi-domain learning makes use of state-of-the-art deep transfer learning ap- model based on residual adapters, adding databases that are proaches. We publicly release the framework including a labelled with the dimensional approach and pose regression flexible command line interface on GitHub, such that in- problems would be possible to further increase the size of terested researchers can use it in their own work for a the training data. Finally, transferring knowledge between variety of multi-corpus speech and audio analysis tasks. different domains of paralinguistic speech recognition, e. g. EMONET was investigated and evaluated for the task of , the detection of deception from speech, with the help of multi-corpus SER on a collection of 26 existing SER corpora. the adapter approach can be investigated. EMONET adapts the residual adapter approach for multi- domain visual recognition was to the task of SER from mel- spectrograms. ACKNOWLEDGMENTS Needing only a small portion of additional parameters This research was partially supported by Deutsche for each database, it allowed for effective multi-domain Forschungsgemeinschaft (DFG) under grant agreement No. training on EMOSET, leading to favourable results when 421613952 (ParaStiChaD), and Zentrales Innovationspro- compared to training models from scratch or adapting only gramm Mittelstand (ZIM) under grant agreement No. the classifier head to each corpus. When all of EMOSET is 16KN069455 (KIRun). utilised for training, either in a multi-domain fashion or by aggregating the corpora by mapping the included categories to arousal and valence classes, test set performance increases REFERENCES could be achieved for 21 of the 26 corpora. For ten databases, [1] M. B. Hoy, “Alexa, siri, cortana, and more: an introduction to these improvements were statistically significant while on voice assistants,” Medical reference services quarterly, vol. 37, no. 1, the other hand there are only two datasets (eNTERFACE pp. 81–88, 2018. and MES) that seem to always be negatively affected by the [2]G.L opez,´ L. Quesada, and L. A. Guerrero, “Alexa vs. siri vs. approach. Compared to fully finetuning a pre-trained model cortana vs. google assistant: a comparison of speech-based nat- ural user interfaces,” in Proceedings of the International Conference (which requires around ten times the number of trainable on Applied Human Factors and Ergonomics. Springer, 2017, pp. parameters), the adapter approach often came out on top. 241–250. The results with utilising the residual adapter model [3] A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, for transfer and multi-domain learning in SER motivates J. Nunn, B. Hedayatnia, M. Cheng, A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan, H. Song, S. Jayadevan, G. Hwang, and further research and exploration. One limitation of the work A. Pettigrue, “Conversational ai: The science behind the alexa presented herein is that the base architecture of the ResNet prize,” 2018. has not been extensively optimised for a speech recognition [4] H. J. Levesque, Common sense, the Turing test, and the quest for real AI. MIT Press, 2017. task. Here, different configurations and variations, e. g. , [5] P. T. Young, “Emotion in man and animal; its nature and relation with the number of filters, depth and width of the network, to attitude and motive.” 1943. should be evaluated. Moreover, for purposes of constraining [6] R. S. WoodworTH, “Psychology, henry holt & company,” New computational and time requirements in favour of exploring York, 1940. [7] P. T. Young, “Motivation of behavior.” 1936. a wider range of transfer learning settings, the model was [8] P. Salovey and J. D. Mayer, “Emotional intelligence,” Imagination, kept quite small. Increasing the model size could further cognition and personality, vol. 9, no. 3, pp. 185–211, 1990. improve performance but would require adding a larger [9] J. E. Ciarrochi, J. Forgas, J. D. Mayer et al., Emotional intelligence in amount of training data. This training data could come everyday life. Psychology Press/Erlbaum (UK) Taylor & Francis, 2006. from large scale audio recognition databases that are not [10] J. D. Mayer, D. R. Caruso, and P. Salovey, “Emotional intel- immediately related to SER, such as AudioSet [47] or the ligence meets traditional standards for an intelligence,” Intelli- large scale speaker recognition dataset VoxCeleb [109]. Hav- gence, vol. 27, no. 4, pp. 267–298, 1999. [11] J. D. Mayer, P. Salovey, D. R. Caruso, and G. Sitarenios, “Emo- ing found an optimised model architecture for training, tional intelligence as a standard intelligence.” 2001. improvements could further be made by experimenting [12] C. Becker, S. Kopp, and I. Wachsmuth, “Why emotions should be with the degree of influence each EMOSET corpus has on integrated into conversational agents,” Conversational informatics: the shared model weights during training. This could for an engineering approach, pp. 49–68, 2007. [13] G. Ball and J. Breese, “Emotion and personality in a conversa- example be investigated by adjusting the probability of tional agent,” Embodied conversational agents, pp. 189–219, 2000. sampling a batch from a specific dataset compared to the [14] L. Y. Mano, B. S. Faic¸al, L. H. Nakamura, P. H. Gomes, G. L. Li- default round-robin strategy utilised in this manuscript. bralon, R. I. Meneguete, P. R. Geraldo Filho, G. T. Giancristofaro, For the different problem of multi-lingual large-scale ASR, G. Pessin, B. Krishnamachari et al., “Exploiting iot technologies for enhancing health smart homes through patient identification residual adapters and probabilistic sampling have been and emotion recognition,” Computer Communications, vol. 89, pp. explored in combination with an RNN architecture trained 178–190, 2016. 16

[15] D. Tacconi, O. Mayora, P. Lukowicz, B. Arnrich, C. Setz, [37] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emo- G. Troster, and C. Haring, “Activity and emotion recognition to tion recognition: Features, classification schemes, and databases,” support early diagnosis of psychiatric diseases,” in Proceedings of , vol. 44, no. 3, pp. 572–587, 2011. the International Conference on Pervasive Computing Technologies for [38] C.-N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features Healthcare. IEEE, 2008, pp. 100–102. and classifiers for emotion recognition from speech: a survey [16] S. Tokuno, G. Tsumatori, S. Shono, E. Takei, T. Yamamoto, from 2000 to 2011,” Artificial Intelligence Review, vol. 43, no. 2, G. Suzuki, S. Mituyoshi, and M. Shimura, “Usage of emotion pp. 155–177, 2015. recognition in military health care,” in Proceedings of the Defense [39] S. Ottl, S. Amiriparian, M. Gerczuk, V. Karas, and B. Schuller, Science Research Conference and Expo (DSR). IEEE, 2011, pp. 1–5. “Group-level speech emotion recognition utilising deep spec- [17] N. Cummins, J. Epps, M. Breakspear, and R. Goecke, “An in- trum features,” in Proceedings of the International Conference on vestigation of depressed speech detection: Features and normal- Multimodal Interaction (ICMI), 2020, pp. 821–826. ization,” in Proceedings of the Conference of the International Speech [40] B. Schuller, Z. Zhang, F. Weninger, and G. Rigoll, “Selecting Communication Association (INTERSPEECH), 2011. training data for cross-corpus speech emotion recognition: Pro- [18] L.-S. A. Low, N. C. Maddage, M. Lech, L. B. Sheeber, and N. B. totypicality vs. generalization,” in Proceedings of the Afeka-AVIOS Allen, “Detection of clinical depression in adolescentsaˆ€™ speech Speech Processing Conference, Tel Aviv, Israel, 2011. during family interactions,” IEEE Transactions on Biomedical Engi- [41] ——, “Using multiple databases for training in emotion recogni- neering, vol. 58, no. 3, pp. 574–586, 2010. tion: To unite or to vote?” in Proceedings of the Conference of the [19] N. Cummins, S. Scherer, J. Krajewski, S. Schnieder, J. Epps, International Speech Communication Association (INTERSPEECH), and T. F. Quatieri, “A review of depression and suicide risk 2011. assessment using speech analysis,” Speech Communication, vol. 71, [42] Z. Zhang, F. Weninger, M. Wollmer,¨ and B. Schuller, “Unsu- pp. 10–49, 2015. pervised learning in cross-corpus acoustic emotion recognition,” [20] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps, in Proceedings of the Workshop on Automatic Speech Recognition & “Diagnosis of depression by behavioural signals: a multimodal Understanding. IEEE, 2011, pp. 523–528. approach,” in Proceedings of the International Workshop on Audio/Vi- [43] B. Schuller, B. Vlasenko, F. Eyben, M. Wollmer, A. Stuhlsatz, sual Emotion Challenge. ACM, 2013, pp. 11–20. A. Wendemuth, and G. Rigoll, “Cross-corpus acoustic emotion [21] B. Stasak, J. Epps, N. Cummins, and R. Goecke, “An investigation recognition: Variances and strategies,” Transactions on Affective of emotional speech in depression classification.” in Proceedings of Computing, vol. 1, no. 2, pp. 119–131, 2010. the Conference of the International Speech Communication Association [44] H. Kaya and A. A. Karpov, “Efficient and effective strategies for (INTERSPEECH), 2016, pp. 485–489. cross-corpus acoustic emotion recognition,” Neurocomputing, vol. [22] B. W. Schuller, “Speech emotion recognition: Two decades in a 275, pp. 1028–1034, 2018. nutshell, benchmarks, and ongoing trends,” Communications of [45] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, the ACM, vol. 61, no. 5, pp. 90–99, 2018. R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep [23] A. R. Avila, Z. A. Momin, J. F. Santos, D. O’Shaughnessy, and speech: Scaling up end-to-end speech recognition,” arXiv preprint T. H. Falk, “Feature pooling of modulation spectrum features arXiv:1412.5567, 2014. for improved speech emotion recognition in the wild,” IEEE [46] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Batten- Transactions on Affective Computing, 2018. berg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., [24] S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: “Deep speech 2: End-to-end speech recognition in english and a review,” International journal of speech technology, vol. 15, no. 2, mandarin,” in Proceedings of the International Conference on Machine pp. 99–117, 2012. Learning, 2016, pp. 173–182. [25] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A [47] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, database for facial expression, valence, and arousal computing in R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology the wild,” IEEE Transactions on Affective Computing, vol. 10, no. 1, and human-labeled dataset for audio events,” in Proceedings of the pp. 18–31, 2017. International Conference on Acoustics, Speech and (ICASSP). IEEE, 2017, pp. 776–780. [26] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient parametrization of multi-domain deep neural networks,” in Proceedings of the [48] K. Simonyan and A. Zisserman, “Very deep convolutional Conference on and Pattern Recognition (CVPR). networks for large-scale image recognition,” arXiv preprint IEEE, 2018, pp. 8119–8127. arXiv:1409.1556, 2014. [27] ——, “Learning multiple visual domains with residual adapters,” [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, in Advances in Neural Information Processing Systems, 2017, pp. “Rethinking the inception architecture for computer vision,” 506–516. in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 2818–2826. [28] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, S. Pu- gachevskiy, and B. Schuller, “Bag-of-deep-features: Noise-robust [50] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for deep feature representations for audio analysis,” in Proceedings of image recognition,” in Proceedings of the Conference on Computer the International Joint Conference on Neural Networks (IJCNN). Rio Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778. de Janeiro, Brazil: IEEE, July 2018, pp. 2419–2425. [51] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception- v4, inception-resnet and the impact of residual connections on [29] B. W. Schuller and A. M. Batliner, “Emotion, affect and personal- learning,” in Proceedings of the Conference on Artificial Intelligence ity in speech and language processing,” 1988. (AAAI), 2017. [30] H. Gunes and B. Schuller, “Categorical and dimensional affect [52] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, analysis in continuous input: Current trends and future direc- R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold tions,” Image and Vision Computing, vol. 31, no. 2, pp. 120–136, et al., “Cnn architectures for large-scale audio classification,” in 2013. Proceedings of the International Conference on Acoustics, Speech and [31] P. Ekman, “Basic emotions,” Handbook of cognition and emotion, Signal Processing (ICASSP). IEEE, 2017, pp. 131–135. vol. 98, no. 45-60, p. 16, 1999. [53] P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emo- [32] ——, “Expression and the nature of emotion,” Approaches to tion recognition using deep neural networks,” in Proceedings of the emotion, vol. 3, no. 19, p. 344, 1984. International Conference on Acoustics, Speech and Signal Processing [33] P. E. Ekman and R. J. Davidson, The nature of emotion: Fundamental (ICASSP). IEEE, 2018, pp. 5089–5093. questions. Oxford University Press, 1994. [54] S. Amiriparian, “Deep representation learning techniques for au- [34] N. H. Frijda et al., The emotions. Cambridge University Press, dio signal processing,” Ph.D. dissertation, Technische Universitat¨ 1986. Munchen,¨ 2019. [35] A. Mehrabian, Basic dimensions for a general psychological theory: [55] S. Amiriparian, A. Baird, S. Julka, A. Alcorn, S. Ottl, S. Petrovic,´ Implications for personality, social, environmental, and developmental E. Ainger, N. Cummins, and B. Schuller, “Recognition of echolalic studies. Oelgeschlager, Gunn & Hain Cambridge, MA, 1980, autistic child vocalisations utilising convolutional recurrent neu- vol. 2. ral networks,” in Proceedings of the Conference of the International [36] J. A. Russell, “A circumplex model of affect.” Journal of personality Speech Communication Association (INTERSPEECH). Hyderabad, and social psychology, vol. 39, no. 6, p. 1161, 1980. India: ISCA, September 2018, pp. 2334–2338. 17

[56] M. Neumann and N. T. Vu, “Attentive convolutional neural net- in music utilising convolutional and recurrent neural networks,” work based speech emotion recognition: A study on the impact in MediaEval Benchmarking Initiative for Multimedia Evaluation, of input features, signal length, and acted speech,” arXiv preprint Sophia Antipolis, France, 2019. arXiv:1706.00612, 2017. [74] S. Amiriparian, N. Cummins, M. Gerczuk, S. Pugachevskiy, [57] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech S. Ottl, and B. Schuller, ““are you playing a shooter again?!” emotion recognition using recurrent neural networks with local deep representation learning for audio-based video game genre attention,” in Proceedings of the International Conference on Acous- recognition,” IEEE Transactions on Games, vol. 11, 2018. tics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2227– [75] S. Amiriparian, M. Gerczuk, S. Ottl, L. Stappen, A. Baird, 2231. L. Koebe, and B. Schuller, “Towards cross-modal pre-training and [58] R. Girdhar and D. Ramanan, “Attentional pooling for action learning tempo-spatial characteristics for audio recognition with recognition,” in Advances in Neural Information Processing Systems, convolutional and recurrent neural networks,” EURASIP Journal 2017, pp. 34–45. on Audio, Speech, and Music Processing, vol. 2020, no. 1, pp. 1–11, [59] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention based 2020. fully convolutional network for speech emotion recognition,” [76] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial in Proceedings of the Asia-Pacific Signal and Information Processing discriminative domain adaptation,” in Proceedings of the Confer- Association Annual Summit and Conference (APSIPA ASC). IEEE, ence on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018, pp. 1771–1775. 2017, pp. 7167–7176. [60] W.-C. Lin and C. Busso, “An efficient temporal modeling ap- [77] J. Quinonero-Candela,˜ M. Sugiyama, A. Schwaighofer, and proach for speech emotion recognition by mapping varied du- N. Lawrence, “Covariate shift and local learning by distribution ration sentences into fixed number of chunks,” Proceedings of matching,” 2008. the Conference of the International Speech Communication Association [78] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, (INTERSPEECH), pp. 2322–2326, 2020. and B. Schoelkopf, “Dataset shift in machine learning, chapter 8. [61] B. W. Schuller, S. Steidl, A. Batliner, P. B. Marschik, H. Baumeister, covariate shift and local learning by distribution matching,” 2009. F. Dong, S. Hantke, F. B. Pokorny, E.-M. Rathner, K. D. Bartl- [79] M. Sugiyama, N. D. Lawrence, A. Schwaighofer et al., “Dataset Pokorny et al., “The interspeech 2018 computational paralinguis- shift in machine learning,” 2017. tics challenge: Atypical & self-assessed affect, crying & heart [80] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation beats.” in Proceedings of the Conference of the International Speech by backpropagation,” arXiv preprint arXiv:1409.7495, 2014. Communication Association (INTERSPEECH), 2018, pp. 122–126. [81] M. Abdelwahab and C. Busso, “Supervised domain adaptation [62] C. Gorrostieta, R. Brutti, K. Taylor, A. Shapiro, J. Moran, for emotion recognition from speech,” in Proceedings of the In- A. Azarbayejani, and J. Kane, “Attention-based sequence classi- ternational Conference on Acoustics, Speech and Signal Processing Proceedings of the Conference of the fication for affect detection.” in (ICASSP). IEEE, 2015, pp. 5058–5062. International Speech Communication Association (INTERSPEECH), [82] A. Hassan, R. Damper, and M. Niranjan, “On acoustic emotion 2018, pp. 506–510. recognition: compensating for covariate shift,” IEEE Transactions [63] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Deep on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1458– architecture enhancing robustness to noise, adversarial attacks, 1468, 2013. and cross-corpus setting for speech emotion recognition,” arXiv preprint arXiv:2005.08453, 2020. [83] B. Schuller, D. Arsic, G. Rigoll, M. Wimmer, and B. Radig, “Audiovisual behavior modeling by combined feature spaces,” [64] T. Fujioka, T. Homma, and K. Nagamatsu, “Meta-learning for in Proceedings of the International Conference on Acoustics, Speech speech emotion recognition considering ambiguity of emotion and Signal Processing (ICASSP), vol. 2. IEEE, 2007, pp. II–733. labels,” Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH), pp. 2332–2336, 2020. [84] T. L. Nwe, S. W. Foo, and L. C. De Silva, “Speech emotion Speech communication [65] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu, “Online recognition using hidden markov models,” , multi-object tracking using cnn-based single object tracker with vol. 41, no. 4, pp. 603–623, 2003. spatial-temporal attention mechanism,” in Proceedings of the Inter- [85] Institute of Automation Chinese Academy of Sciences, “The se- national Conference on Computer Vision. IEEE, 2017, pp. 4836–4845. lected speech emotion database of institute of automation chinese [66] J. Wagner, D. Schiller, A. Seiderer, and E. Andre,´ “Deep learning academy of sciences (casia),” 2010, http://www.chineseldc.org/ in paralinguistic recognition tasks: Are hand-crafted features still resource info.php?rid=76. relevant?” 2018. [86] P. Liu and M. D. Pell, “Recognizing vocal emotions in mandarin [67] S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, M. Freitag, chinese: A validated database of chinese vocal emotional stim- S. Pugachevskiy, and B. Schuller, “Snore sound classification uli,” Behavior research methods, vol. 44, no. 4, pp. 1042–1051, 2012. using image-based deep spectrum features,” in Proceedings of [87] E. Parada-Cabaleiro, G. Costantini, A. Batliner, M. Schmitt, and the Conference of the International Speech Communication Association B. W. Schuller, “Demos: an italian emotional speech corpus,” (INTERSPEECH). Stockholm, Sweden: ISCA, August 2017, pp. Language Resources and Evaluation, pp. 1–43, 2019. 3512–3516. [88] I. S. Engberg, A. V. Hansen, O. Andersen, and P. Dalsgaard, [68] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Design, recording and verification of a danish emotional speech “Imagenet: A large-scale hierarchical image database,” in Proceed- database,” in Proceedings of the European Conference on Speech ings of the Conference on Computer Vision and Pattern Recognition Communication and Technology, 1997. (CVPR). Miami, FL: IEEE, 2009, pp. 248–255. [89] B. Schuller, “Automatische emotionserkennung aus sprachlicher [69] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- und manueller interaktion,” Ph.D. dissertation, Technische Uni- cation with deep convolutional neural networks,” in Advances in versitat¨ Munchen,¨ 2006. Neural Information Processing Systems, 2012, pp. 1097–1105. [90] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and [70] N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, B. Weiss, “A database of german emotional speech,” in Pro- and B. Schuller, “An image-based deep spectrum feature repre- ceedings of the European Conference on Speech Communication and sentation for the recognition of emotional speech,” in Proceedings Technology, 2005. of the International Conference on Multimedia (MM). Mountain [91] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, “Emotion View, CA: ACM, October 2017, pp. 478–484. recognition in the wild challenge 2014: Baseline, data and pro- [71] N. Cummins, S. Amiriparian, S. Ottl, M. Gerczuk, M. Schmitt, tocol,” in Proceedings of the International Conference on Multimodal and B. Schuller, “Multimodal bag-of-words for cross domains Interaction. ACM, 2014, pp. 461–466. sentiment analysis,” in Proceedings of the International Conference [92] O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, audio-visual emotion database,” in Proceedings of the International pp. 4954–4958. Conference on Data Engineering Workshops (ICDEW). IEEE, 2006, [72] S. Amiriparian, N. Cummins, S. Ottl, M. Gerczuk, and B. Schuller, pp. 8–8. “Sentiment analysis using image-based deep spectrum features,” [93] A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, in Proceedings of the Conference on Affective Computing and Intelli- S. Tal, S. Elfstrom,¨ A. Rade,˚ O. Golan, S. Bolte,¨ S. Baron-Cohen, gent Interaction (ACII), San Antonio, TX, 2017, pp. 26–29. and D. Lundqvist, “The eu-emotion voice database,” Behavior [73] S. Amiriparian, M. Gerczuk, E. Coutinho, A. Baird, S. Ottl, Research Methods, vol. 51, no. 2, pp. 493–506, Apr 2019. [Online]. M. Milling, and B. Schuller, “Emotion and themes recognition Available: https://doi.org/10.3758/s13428-018-1048-1 18

[94] E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. W. Schuller, “Categorical vs dimensional of italian emotional speech.” in Proceedings of the Conference of the Inter- national Speech Communication Association (INTERSPEECH), 2018, pp. 3638–3642. [95] A. Batliner, S. Steidl, and E. Noth,¨ “Releasing a thoroughly annotated and processed spontaneous emotional database: the fau aibo emotion corpus,” in Proceedings of a Satellite Workshop of the International Conference on Language Resources and Evaluation (LREC), vol. 2008, 2008, p. 28. [96] T. Banziger¨ and K. R. Scherer, “Introducing the geneva multi- modal emotion portrayal (gemep) corpus,” Blueprint for affective computing: A sourcebook, pp. 271–294, 2010. [97] T. Banziger,¨ M. Mortillaro, and K. R. Scherer, “Introducing the geneva multimodal expression corpus for experimental research on emotion perception.” Emotion, vol. 12, no. 5, p. 1161, 2012. [98] R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression.” Journal of personality and social psychology, vol. 70, no. 3, p. 614, 1996. [99] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, p. 335, 2008. [100] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cam- bria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” arXiv preprint arXiv:1810.02508, 2018. [101] B. Schuller, F. Eyben, S. Can, and H. Feussner, “Speech in minimal invasive surgery – towards an affective language resource of real-life medical operations,” in Proceedings of the International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, Valletta, Malta, 01 2020, pp. 5–9. [102] F. Schiel, S. Steininger, and U. Turk,¨ “The smartkom multimodal corpus at bas.” in Proceedings of the International Conference on Language Resources and Evaluation (LREC). Citeseer, 2002. [103] J. H. Hansen and S. E. Bou-Ghazale, “Getting started with susas: A speech under simulated and actual stress database,” in Pro- ceedings of the European Conference on Speech Communication and Technology, 1997. [104] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andre,´ C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing,” IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2015. [105] Z. Zhao, Z. Bao, Y. Zhao, Z. Zhang, N. Cummins, Z. Ren, and B. Schuller, “Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition,” IEEE Access, vol. 7, pp. 97 515– 97 525, 2019. [106] Z. Zhao, Y. Zhao, Z. Bao, H. Wang, Z. Zhang, and C. Li, “Deep spectrum feature representations for speech emotion recogni- tion,” in Proceedings of the Joint Workshop on Affective Social Multi- media Computing and first Multi-Modal Affective Computing of Large- Scale Multimedia Data, 2018, pp. 27–33. [107] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016. [108] Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, no. 2, pp. 153–157, 1947. [109] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. [110] A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramab- hadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee, “Large-scale multilingual speech recognition with a streaming end-to-end model,” arXiv preprint arXiv:1909.05330, 2019.