1 Gated Recurrent Unit (GRU) for Emotion Classification from Noisy Speech Rajib Rana∗, Julien Epps†, Raja Jurdak‡, Xue Li§, Roland Goecke¶, Margot Breretonk and Jeffrey Soar∗∗ ∗Institute of Resilient Regions (IRR) University of Southern Queensland, Springfield, QLD 4300 Email: [email protected] †School of Electrical Engineering and Telecommunications, University of New South Wales ‡Distributed Sensing Systems, CSIRO-Data61 §School of Information Technology and Electrical Engineering, University of Queensland ¶Information Technology & Engineering, University of Canberra kComputer Human Interaction, Queensland University of Technology ∗∗School of Management and Enterprise, University of Southern Queensland

Abstract—Despite the enormous interest in emotion classifica- models. tion from speech, the impact of noise on emotion classification Out of the many forms of models, we con- is not well understood. This is important because, due to the sider Recurrent Neural Networks (RNNs) as they are targeted tremendous advancement of the smartphone technology, it can be a powerful medium for speech emotion recognition in the outside to model the long range dependencies between successive laboratory natural environment, which is likely to incorporate observations. Speech emotion possesses temporal dependency background noise in the speech. We capitalize on the current as it is unlikely to change rapidly between subsequent speech breakthrough of (RNN) and seek utterances [14]. Therefore, the Recurrent Neural Networks to investigate its performance for emotion classification from are most suitable for emotion recognition from speech. How- noisy speech. We particularly focus on the recently proposed Gated Recurrent Unit (GRU), which is yet to be explored for ever, RNNs often suffer from the classical “Vanishing” and emotion recognition from speech. Experiments conducted with “Exploding” Gradient problems, which results in failing to speech compounded with eight different types of noises reveal learn long-range dependencies. To avoid this, two variants that GRU incurs an 18.16% smaller run-time while performing of Recurrent Neural Networks have been proposed, which quite comparably to the Long Short-Term Memory (LSTM), uses a “gating” approach to avoid these problems: (1) Long which is the most popular Recurrent Neural Network proposed to date. This result is promising for any embedded platform in Short-Term Memory (1997) [15] and (2) Gated Recurrent Unit general and will initiate further studies to utilize GRU to its full (2014) [16]. potential for emotion recognition on smartphones. Smartphones are great platforms for speech emotion recog- nition, as people are close to their phones for an extended period of time. Research shows that almost 79% of people I.INTRODUCTION have their smartphones with them 22 hours a day1. In addition, Automatic Speech Emotion Recognition has gained increas- the processing capacity of the modern smartphones is also ing interest in both research and commercial space due to extraordinary. In the scope of this paper we assume that speech its tremendous potential to determine affective states [1]. Au- samples are collected during phone conversation [17]. arXiv:1612.07778v1 [cs.HC] 13 Dec 2016 tomatic Speech emotion recognition can offer unprecedented The key challenge of emotion recognition from phone opportunities for the Human-Computer Interaction Systems, conversation is background noise as people can talk over as for example, recommendation systems can be designed phone anywhere including the cafeteria, park, office and many to make more accurate affect-sensitive recommendations [2]. more places. The mainstream research in emotion recognition This will also be highly beneficial for the burgeoning health from speech has mainly focused on clean speech captured in and wellbeing applications as, for example, affect-diaries can controlled laboratory environment. Therefore, the impact of be built to keep people aware of any prolonged negative noise in emotion classification is not well understood [18]. mood, which would potentially prevent the onset or relapse Another major hurdle of running any Deep Learning models of affective disorders [3]. on a smartphone is computational complexity. Despite the In recent years, Deep Learning models have revolutionized advancement in the processing capacity, smartphones are still many fields, in particular, automatic speech recognition and limited by battery power. Therefore, a method with a small computer vision [4], [5]. These models have also achieved runtime is a must. Amongst LSTM and GRU, GRU has rela- improved performance in emotion recognition compared to tively simplified architecture, therefore, our focus is mainly on conventional algorithms. For example, [6], GRU. A number of studies have looked into the performance [7], [8], [9], [10], [11], [12], and [13] present some best re- ported results for affective speech analysis using deep learning 1http://www.adweek.com/socialtimes/smartphones/480485 2 of LSTM (e.g., [19], [20], [21]) for emotion recognition from TABLE I: Symbol Definitions speech, however, the performance of GRU for that is currently Symbol Definition d unexplored. xt ∈ R d dimensional input vectors p In this paper, we address the above two challenges. The St ∈ R p dimensional hidden unit d0 0 Ot ∈ R d dimensional outputs contributions of this paper are as follows: d×p U ∈ R maps d-dimensional input to p dimensional hidden unit p×d0 0 1) To the best of our knowledge we for the first time V ∈ R maps p dimensional hidden unit to d dimensional output p×p analyze the accuracy and the run-time complexity of W ∈ R maps p dimensional hidden unit to another hidden unit. Gated Recurrent Unit for emotion classification from speech. 2) We superimpose various real-life noises on clean speech Vanishing and Exploding Gradients: Both of these terms were and analyze the classification performance of the Gated introduced by Bengio et al. [22] in 1994. The exploding Recurrent Unit under various noise exposures. gradient problem refers to the large increase in the norm of 3) We use LSTM as the benchmark and conduct a detailed the gradient during training. The vanishing gradient problem accuracy and run-time comparisons of these two gating refers to the opposite behavior when long-term components Recurrent units for emotion classification from noisy go exponentially fast to norm 0, making it impossible for speech. the model to learn the correlation between temporally distant The paper is organized as follows. In next section, we pro- events. vide background information on Recurrent Neural Networks Any quantity multiplied frequently by an amount slightly followed by the description of how GRU can be used for greater than one can become immeasurably large (exploding). emotion classification from a noisy speech in Section III. We Multiplying by a quantity less than one is also true (vanishing). then present the results, and discussion in Section IV. This Because the layers and time steps of deep neural networks is followed by existing work in Section V; and finally, we relate to each other through multiplication, derivatives are conclude in Section VI. susceptible to vanishing or exploding. Encouragingly, there are a few ways to address these gradient problems. Proper initialization of the weight (W) Ot 1 Ot Ot+1 − matrix and regularization can reduce the impact of vanishing V V V gradients. Another possible solution is to use the Rectified

St 1 St St+1 Linear Unit (ReLU) [23] instead of the conventional tanh or W − W W W sigmoid activation functions. The ReLU derivative is either 0 U U U or 1, so it is not as likely to cause a vanishing or exploding gradient. However, ReLU units can be fragile during training xt 1 xt xt+1 − and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such Fig. 1: Recurrent Neural Networks. a way that the neuron will never activate on any data point again. As a result as much as 40% of the neurons will never be activated across the entire training dataset if the learning rate is set too high. With a rigorously chosen learning rate, II.RECURRENT NEURAL NETWORKS this can be avoided. The conventional Feed Forward Neural Network is trained An even more effective solution is to use the gated Recur- on labeled data until it minimizes the prediction error. Whereas rent Neural Network architectures: Long Short-Term Mem- the decision an RNN reached at time step t − 1 affects the ory (LSTM) or Gated Recurrent Unit (GRU). LSTMs first decision it will reach at time step t. Therefore, RNNs have proposed by Hochreiter et al. in 1997 are one of the most two input sources: the present, and the recent past, which it widely used Recurrent Neural Networks today. GRUs, first uses in combination in determining a response to a new input. proposed in 2014, are simplified versions of LSTMs. Both of An example Recurrent Neural Networks (RNNs) is shown these RNN architectures were purposely built to address the in Figure 1 and the related symbols are defined in Table I. vanishing and the exploding gradient problem and efficiently A hidden state (st) is a non-linear transformation of a linear learn long-range dependencies. To assist the understanding of combination of input (xt) and previous hidden state (st 1). − GRU in the next section we first describe LSTM. Output (ot) at time t is only dependent on the current hidden state st. The output is associated with a probability pt de- termined through a “softmax” function. In Neural Networks a A. Long Short-Term Memory Units (LSTMs) softmax function is implemented in the final layer of a network The Long Short Term memory is a special kind of Recurrent used for classification. For a classification with K classes, the Neural Networks, that eliminates the shortcoming of vanish- softmax function determines the probability of a probe being ing or exploding gradient problem of the Recurrent Neural classified as each of the K classes. Networks. This makes LSTM suitable to learn from history to classify, process and predict time series when there are very st = tanh(W st 1 + Uxt) − long and unknown time lags between important events. An ot = V st LSTM network is made up of LSTM blocks that have three 3

%#

C˜ %# $# h $# h˜ ! # C ! # &# !#"# !#"# (b) Gated Recurrent Unit. z: update Gate, r: Reset Gate.

(a) Long Short Term Memory Unit. i: Input Gate, f: Forget Gate, o: Output Gate. Fig. 2: Gated Recurrent Neural Networks. gates: input, output and forget gate, which help it to remember with an example. While analyzing a text corpus when the a value for an arbitrary long time; forget the value when it pointer comes to the end of a document, it could be the case the is not important to remember anymore or output the value. next document does not have a correlation with the previous LSTMs help preserve the error that can be backpropagated one. Therefore, the memory cell should be set to zero before through time and layers. By maintaining a more constant error, the net ingests the first element of the next document. they allow recurrent nets to continue to learn over many time An LSTM cell has been shown in Fig 2a. It is not limited to steps, for example over 1000. computing the weighted sum of the inputs and then appliying LSTMs contain information outside the normal flow of a nonlinear function; rather each j-th LSTM unit maintains a j the recurrent network in a gated cell. Information can be memory ct at time t. The activation function of the LSTM is stored in, written to, or read from a cell, much like data in a j j j computer’s memory. The cell makes decisions about what to ht = ot tanh ct . store, and when to allow reads, writes, and erasures, via gates j The output gate ot modulates the amount of memory content that open and close. Unlike the digital storage on computers, exposure. With Vo as a diagonal matrix, It is calculated by however, these gates are analog, implemented with element- j j ot = σ(Woxt + Uf ht 1 + Voct) . wise multiplication by sigmoids, which are all in the range − of [0 − 1]. Analog has the advantage over digital of being j The memory ct is updated by partially forgetting the existing differentiable, and therefore suitable for . j memory and adding a new memory c˜t . The extent to which Those gates act on the signals they receive, and similar to the existing memory is forgotten is controlled by a forget gate the neural networks nodes, they block or pass on information f j. based on its strength and import, which they filter with their t j j own sets of weights. Those weights, like the weights that ft = σ(Wf xt + Uoht 1 + Vict 1) . − − modulate input and hidden states, are adjusted via the recurrent And the extent to which new memory is added is controlled networks learning process. That is, the cells learn when to by an input gate ij allow data to enter, leave or be deleted through the iterative t j j process of making predictions, backpropagating error, and it = σ(Wixt + Uiht 1 + Vf ct 1) . adjusting weights via gradient descent. − − LSTMs memory cells give different roles to addition and Controlled by these gates the existing memory is updated using multiplication in the transformation of input. Instead of deter- the following equations. mining the subsequent cell state by multiplying its current state j j c˜t = tanh (Wcxt + Ucht 1) , − with new input, they add the two, which helps them preserve j j j j j ct = ft ct 1 + it c˜t . a constant error when it must be backpropagated at depth. − Different sets of weights filter the input for input, output, and forgetting. The forget gate is represented as a linear B. Gated Recurrent Units (GRUs) identity function, because if the gate is open, the current state The Gated Recurrent Unit is a slightly more simplified of the memory cell is simply multiplied by one, to propagate variation of the LSTM. It combines the forget and input gates forward one more time step. into a single “update gate” and has an additional “reset gate”. The LSTMs need a forget gate although their purpose is to The end model is simpler than standard LSTM models and is link distant occurrences to a final output. This can be justified becoming increasingly popular. 4

02621)*/*44.-+*621)

*/,*1))8,3*1+,) !) "&)=#'($) !,*593,) :53*+621)

%2.4;))8,3*1+,) %2.4,)

Fig. 3: Experimental setting for emotion classification from noisy speech.

TABLE II: LSTM vs GRU LSTM GRU Controlled Memory Exposure The amount of memory seen by the other The whole memory is exposed to the network units of the network is controlled by the Output Gate New Memory Computation No separate control for amount of Controls the information flow from the previous activation information flow from the previous time step Complexity vs Performance With an additional gate Has fewer parameters and is likely to have higher complexity thus may train comparatively faster or need less data to generalize

A Gated Recurrent Unit like LSTM modulates information speech. Features are extracted from the noisy speech and feed inside the unit, however, without having a separate memory to the GRU for emotion classification. We have used the same j cell (see Fig 2b). The activation ht of the GRU at time t is a framework for LSTM to contrast its classification performance j linear interpolation between the previous activation ht 1 and with that of GRU. ˜j − the candidate activation ht : j j j j ˜j A. Description of Datasets ht = (1 − zt )ht 1 + zt ht − The Berlin emotional speech database [?] is used in ex- j The update gate zt decides how much the unit updates its periments for classifying discrete emotions. In this database, activation. ten actors, five males and five females each uttered ten j j sentences (5 short and 5 longer, typically between 1.5 and zt = σ(Wzxt + Uzht 1) . − 4 seconds) in German to simulate seven different emotions: ˜j The candidate activation ht is computed similarly to the update anger, boredom, disgust, anxiety/fear, happiness, and sadness. gate: Utterances scoring higher than 80% emotion recognition rate ˜j j in a subjective listening test are included in the database. We ht = tanh(W xt + U(rt. ∗ ht 1)) , − classify all the seven emotions in this work. The numbers j [15] where rt is a set of reset gates and .∗ denotes an element- of speech files for these emotion categories in the presented j wise multiplication. When the reset gate is off (rt == 0), it Berlin database are: anger (127), boredom (81), disgust (46), allows the unit to forget the past. This is analogous to letting fear (69), joy (71), neutral (79) and sadness (62). We use the unit reading the first symbol of an input sequence. The this prototypical database for the preliminary investigation of reset gate is computed using the following formula GRU’s performance for emotion recognition. j j In order to simulate noise-corrupted speech signals, the DE- rt = σ(Wrxt +r ht 1) − MAND (Diverse Environments Multi-channel Acoustic Noise Update gate z controls how much the past state should Database) noise database [?] has been used in this paper. This matter now. Units with short-term dependencies will have database involves 18 types of noises wherein we have used active reset gates r. Units with long-term dependencies have noise from traffic, cafe, living room, park, washing, car, office active update gates z. and river. The recordings were captured at a sampling rate of GRU and LSTM have many commonalities, yet there are 48 kHz and with a target length of 5 minutes (300 s). Actual some differences between these two, which we summarize in audio capture time was slightly longer thereby allowing the Table II. removal of set-up noises and other artifacts by trimming. The recorded signals were not subject to any gain normalization. III.GATED RECURRENT UNITFOR EMOTION Therefore, the original noise power in each environment was CLASSIFICATION FROM NOISY SPEECH preserved. An emotion classification framework embodying a Gated Recurrent Unit is shown in Fig. 3. Our focus is on emotion B. GRU Implementation classification from noisy speech, so we simulate noisy speech We use the PyBrain Toolbox [24] to implement the GRU. upon superimposing various environmental noises on clean We evaluate the classification performance across three param- 5

0.12 0.12 GRU GRU GRU-No Bias GRU-No Bias 0.10 0.10

0.08 0.08

0.06 0.06 Validation Error Validation Error Validation 0.04 0.04

0.02 0.02

0.00 0.00 1 10^-1 10^-3 10^-5 10^-7 10^-9 1 10^-1 10^-3 10^-5 10^-7 10^-9 Learning Rate Learning Rate

(a) GRU - No Noise. (b) GRU - Traffic Noise.

0.12 0.12 LSTM LSTM LSTM-No Bias LSTM-No Bias 0.10 0.10

0.08 0.08

0.06 0.06 Validation Error Validation Error Validation 0.04 0.04

0.02 0.02

0.00 0.00 1 10^-1 10^-3 10^-5 10^-7 10^-9 1 10^-1 10^-3 10^-5 10^-7 10^-9 Learning Rate Learning Rate

(c) LSTM - No Noise. (d) LSTM - Traffic Noise. Fig. 4: Classification Performance - Impact of Noise, Bias and Learning Rate. eters including learning rate, number of cells and bias. To use absence of the bias. We find that learning rate has a very GRU and LSTM for classification we use softmax function small impact and GRU performs better without the bias terms. in the output layer. The Pybrain package uses 75% of data Similar performances have been observed for LSTM. These for training and 25% for validation. For accuracy we use the behaviors can be observed in the noisy conditions as well. We validation error in the plots. have shown results for arbitrarily chosen Traffic noise. Setting learning rates for plain Stochastic Gradient Descent C. Feature Selection in a neural network is usually a process of starting with an initial arbitrary value such as 0.01 and then doing a cross- For each speech segment, we choose a 13 coefficient Mel validation to find an optimal value. Common values range over Frequency Cepstral Coefficients (MFCC) as a feature vector, a few orders of magnitude from 0.0001 up to 1. A small value which have been used by many researchers (e.g., [25],[25] is desired to avoid overshooting, but a very small value cannot and many more.) for emotion classification from speech. We be chosen to avoid getting stuck in local minima or taking long choose the small number of coefficients for dimensionality to descend. For our experiments, a learning rate of 1 yields reduction of feature space. Large feature sets consume a sig- the best performance. nificant amount of memory, jointly with computing and power Again a Bias value usually allows to shift the activation resources and they do not always contribute to improving the function to the left or right, which might be necessary for recognition rate. successful learning. However, in our case adding a bias has a negative impact, it lowers the accuracy. IV. RESULTS AND DISCUSSIONS A. Bias and Learning Rate B. Bias and Number of Cells We compare the classification performance of GRU with We also verify the impact of Number of Cells in presence respect to Bias and Learning Rate in Fig. 4. We reduce the and absence of Bias in Fig. 5. For both GRU and LSTM, the learning rate gradually from 1 to 10−9 in the presence and error is minimum when the number of cells is one and there is 6

0.12 0.12 GRU GRU GRU-No Bias GRU-No Bias 0.10 0.10

0.08 0.08

0.06 0.06 Validation Error Validation Error Validation 0.04 0.04

0.02 0.02

0.00 0.00 1 5 10 20 30 1 5 10 20 30 Number of Cells Number of Cells

(a) GRU - No Noise. (b) GRU - Washing Noise.

0.12 0.12 LSTM LSTM LSTM-No Bias LSTM-No Bias 0.10 0.10

0.08 0.08

0.06 0.06 Validation Error Validation Error Validation 0.04 0.04

0.02 0.02

0.00 0.00 1 5 10 20 30 1 5 10 20 30 Number of Cells Number of Cells

(c) LSTM - No Noise. (d) LSTM - Washing Noise. Fig. 5: Classification Performance - Impact of Noise, Bias and Number of Cells. no bias term. Similar behavior is observed in presence of noise. The comparison results between LSTM and GRU provide We have shown the results for an arbitrary chosen Washing some intuitive insights. For example, GRU performs better for noise. the Washing noise which can be very periodic and not usually Both GRU and LSTM layers consist of blocks which in turn continuous. Whereas, LSTM performs noticeably better than consist of cells. We anticipate that due to the smaller size of GRU in the case of River and Cafe, which are usually sources our dataset the increase in the number of cells does not have of continuous noise. We will conduct further studies in future a positive impact to explain the differences between GRU and LSTM. Using Fig. 6b we can understand the impact of noise (as a whole) on GRU’s performance where we compare the error at C. Accuracy: GRU versus LSTM no-noise with the error combined at all other noisy conditions. After analyzing the values of Number of Cells and Learning We notice that GRU is quite robust to noise. In fact, the error Rate and Bias, we now compare the performance of GRU and at noisy condition is smaller than the error at the no-noisy LSTM in various noisy conditions in Fig. 6. In the following, condition. It is not unnatural for Deep Learning models to we use the following values for the three parameters: Number perform better in the presence of noise as this helps avoid of cell = 1, Bias = False, Learning Rate = 1. overfitting, when the noise is not dominant. We observe similar Out of eight different noise cases, in three cases LSTM robustness for LSTM. and GRU performs the same (see Fig. 6a). While imputed with the Washing noise, GRU performs better than LSTM by D. Run-Time: GRU versus LSTM 1.75%. For four remaining noise imputations LSTM performs We compare the run-time performance of GRU and LSTM better than GRU. Amongst these, for Cafe and River noise on a 2 GHz Intel Core i7 Macbook professional with a 8 LSTM performs noticeably (4.6% and 6.4%, respectively) GB 1600 MHz DDR3 memory. For each noise type and for better than GRU, but for Traffic and Park noise, it only the no-noisy condition we determine the run-time five times. performs marginally better than GRU. To aggregate over the five times, we choose the median value 7

0.08 0.08 GRU GRU 0.07 LSTM 0.07 LSTM

0.06 0.06

0.05 0.05

0.04 0.04

Validation Error Validation 0.03 Error Validation 0.03

0.02 0.02

0.01 0.01

0.00 0.00 Traffic Cafe Living Park Washing Car Office River No-Noise All Noises Combined

(a) (b) Fig. 6: GRU versus LSTM over mean, as the variances were quite high. We then combine recognition in presence of real-life background noise. Also, the median values and comparing the averaged median values our focus is mainly on GRU. found that GRU incurs 18.16% smaller run-time compared to LSTM. VI.CONCLUSION This comparison although were made on desktop Computer, This paper investigates the feasibility of Gated Recurrent it provides important insights about the time complexity dif- Unit (GRU), a gated Recurrent Neural Network, for emotion ferences between GRU and LSTM. In our future study, we classification from noisy speech. We create noisy speech upon aim to deploy the GRU module on smartphone and determine superimposing noises from the cafe, washing, river etc. The re- the run-time complexity. sults show that GRU offers a very comparable accuracy to the most widely used Long Short-Term Memory (LSTM), while V. EXISTING WORK incurring a shorter run-time. For example, LSTM performs better than GRU by 6.4% in the best case, but GRU incurs The closest match of our work is the work done by Tang 18.6% smaller run-time compared to LSTM. Interestingly, for et al. [26], [27], where authors use Gated Recurrent Unit washing noise GRU incurs 1.75% smaller error compared for question detection from speech. Our focus is instead on to LSTM. This accuracy versus time-complexity trade-off of emotion detection from speech. GRU is highly advantageous for any embedded platform in In this paper, we have used LSTM as the benchmark to general and in our future studies, we aim to investigate the evaluate the performance of GRU, as LSTM is the most performance of GRU for real-time emotion recognition on popular Recurrent Neural Networks implementing the gating smartphones. mechanism. In particular, we justify the exploration of GRU for emotion classification from speech due to the success of REFERENCES LSTM in emotion classification. For example, LSTM as a [1] R. A. Calix, L. Javadpour, and G. M. Knapp, “Detection of affective gated recurrent neural network has been shown to success- states from text and speech for real-time human–computer interaction,” fully exploit the long-range dependencies between successive Human Factors: The Journal of the Human Factors and Ergonomics observations and offer good classification accuracy in [19], Society, vol. 54, no. 4, pp. 530–545, 2012. [2] Y. Shi, M. Larson, and A. Hanjalic, “Mining mood-specific movie sim- [20]. It has also been shown in the literature that LSTM can be ilarity with matrix factorization for context-aware recommendation,” in combined with other methods like Multiple Kernel Learning Proceedings of the workshop on context-aware movie recommendation. (MKL) [19] and Convolutional Neural Networks (CNNs) [8] ACM, 2010, pp. 34–40. [3] G. E. Simon, “Social and economic burden of mood disorders,” Biolog- to achieve greater accuracy. It has also been shown that ical psychiatry, vol. 54, no. 3, pp. 208–215, 2003. LSTM [21] can outperform the widely used (e.g., [28], [29], [4] J. Schmidhuber, “Deep learning in neural networks: An overview,” [30]) Support Vector Machine (SVM) when there is enough Neural Networks, vol. 61, pp. 85–117, 2015. [5] A. Y. Hannun, C. Case, J. Casper, B. C. Catanzaro, G. Diamos, training data. E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng, Most of the studies described above consider emotion recog- “Deep speech: Scaling up end-to-end speech recognition,” CoRR, vol. nition from clean speech, but we are interested in emotion abs/1412.5567, 2014. [Online]. Available: http://arxiv.org/abs/1412.5567 [6] R. Brueckner and B. Schuller, “Be at odds? deep and hierarchical neural recognition from noisy speech. One paper by Zhang et al. [31] networks for classification and regression of conflict in speech,” in performs extensive evaluations of LSTM for speech emotion Conflict and Multimodal Communication. Springer, 2015, pp. 403– recognition in presence of non-stationary additive noise and 429. [7]M.W ollmer,¨ Y. Sun, F. Eyben, B. Schuller et al., “Long short-term mem- convolutional noise. Whereas these are synthetic noises (addi- ory networks for noise robust speech recognition.” in INTERSPEECH, tive Gaussian noise), we are more interested in speech emotion 2010, pp. 2966–2969. 8

[8] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, [30] R. Rana, B. Kusy, J. Wall, and W. Hu, “Novel activity classification and S. Zafeiriou et al., “Adieu features? end-to-end speech emotion recog- occupancy estimation methods for intelligent hvac (heating, ventilation nition using a deep convolutional recurrent network,” in 2016 IEEE and air conditioning) systems,” Energy, vol. 93, pp. 245–255, 2015. International Conference on Acoustics, Speech and Signal Processing [31] Z. Zhang, F. Ringeval, J. Han, J. Deng, E. Marchi, and B. Schuller, (ICASSP). IEEE, 2016, pp. 5200–5204. “Facing realism in spontaneous emotion recognition from speech: feature [9] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features enhancement by with lstm neural networks,” Proceedings for speech emotion recognition using convolutional neural networks,” of INTERSPEECH, San Francsico, CA, 2016. IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014. [10] L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli, “Hybrid deep neural network–hidden markov model (dnn- hmm) based speech emotion recognition,” in Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 2013, pp. 312–317. [11]M.W ollmer,¨ F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas- Cowie, R. Cowie et al., “Abandoning emotion classes-towards contin- uous emotion recognition with modelling of long-range dependencies.” in INTERSPEECH, vol. 2008, 2008, pp. 597–600. [12] F. Weninger, J. Bergmann, and B. Schuller, “Introducing currennt–the Dr Rajib Rana is a Vice Chancellors research munich open-source cuda recurrent neural network toolkit,” Journal of fellow in the Institute of Resilient Regions, Uni- Machine Learning Research, vol. 16, no. 3, pp. 547–551, 2015. versity of Southern Queensland (USQ) and an hon- [13] J. Deng, Z. Zhang, F. Eyben, and B. Schuller, “Autoencoder-based orary fellow at the University of Queensland. He unsupervised domain adaptation for speech emotion recognition,” IEEE received his PhD (2011) in Computer Science and Signal Processing Letters, vol. 21, no. 9, pp. 1068–1072, 2014. Engineering from University of New South Wales [14] P. Georgiev, N. D. Lane, K. K. Rachuri, and C. Mascolo, “Dsp. (UNSW), Sydney, Australia. He was the recipient ear: Leveraging co-processor support for continuous audio sensing on of the Presidents and Prime Ministers Gold Medals smartphones,” in Proceedings of the 12th ACM Conference on Embedded for his extraordinary achievement in his Bachelor Network Sensor Systems. ACM, 2014, pp. 295–309. degree (2004). His current research interests are in [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Deep Learning Neural Networks, Quantified Self, computation, vol. 9, no. 8, pp. 1735–1780, 1997. Personal Health Informatics and the Internet of Things (IoT). He is the founder [16] K. Cho, B. Van Merrienboer,¨ D. Bahdanau, and Y. Bengio, “On the of the IoT-Health Lab at USQ, where he leads the research on Early Detection properties of neural machine translation: Encoder-decoder approaches,” of Mental Illness and Autism Spectrum Disorder. Since 2011 Dr Rana has arXiv preprint arXiv:1409.1259, 2014. received a number of research grants including, Queensland Health Innovation [17] R. Rana, “Poster: Context-driven mood mining,” in Proceedings of the Grant (twice: 2014, 2016), USQ major infrastructure grant, and UNSW Post- 14th Annual International Conference on Mobile Systems, Applications, doctoral writing fellowship. He is the lead guest editor of the Special Issue and Services Companion. ACM, 2016, pp. 143–143. on ”Sensors: Deep Learning and Wearable Sensing” (DLWS). [18] N. Yang, I. Demirkol, J. Yuan, W. Heinzelman, Y. Zhou, and M. Sturge- Apple, “How does noise impact speech-based emotion classification?” in Designing Speech and Language Interactions Workshop, the ACM Dr Julien Epps is an Associate Professor in Digital CHI Conference on Human Factors in Computing Systems, 2014. Signal Processing with the School of Electrical En- [19] J. Wei, E. Pei, D. Jiang, H. Sahli, L. Xie, and Z. Fu, “Multimodal con- gineering and Telecommunications at The University tinuous affect recognition based on lstm and multiple kernel learning,” of New South Wales, Sydney, Australia. He also has in Signal and Information Processing Association Annual Summit and an appointment as a Scientific Advisor for Boston- Conference (APSIPA), 2014 Asia-Pacific. IEEE, 2014, pp. 1–4. based startup Sonde Health, where I work on speech- [20]M.W ollmer,¨ M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll, “Lstm- based assessment of mental health, and have an modeling of continuous emotions in an audiovisual affect recognition appointment as a Contributed Principal Researcher framework,” Image and Vision Computing, vol. 31, no. 2, pp. 153–163, with CSIRO, in the ATP Laboratory, where he 2013. works on methods for automatic task analysis using [21] L. Tian, J. D. Moore, and C. Lai, “Emotion recognition in spontaneous behavioural and physiological signals. His research and acted dialogues,” in Affective Computing and Intelligent Interaction interests also include applications of speech modelling and processing, in (ACII), 2015 International Conference on. IEEE, 2015, pp. 698–704. particular to emotion and mental state recognition from speech, and genomic sequence processing. He has also worked on aspects of human computer inter- [22] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies action, including multimodal interfaces and computer-supported cooperative with gradient descent is difficult,” IEEE transactions on neural networks, work. vol. 5, no. 2, pp. 157–166, 1994. [23] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz- mann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814. [24] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, Dr Raja Jurdak is a Senior Principal Research T. RC´ 1/4ckstieC,´ and J. Schmidhuber, “Pybrain,” Journal of Machine Scientist at CSIRO, where he leads the Distributed Learning Research, vol. 11, no. Feb, pp. 743–746, 2010. Sensing Systems Group. He has a PhD in In- [25] D. Neiberg, K. Elenius, and K. Laskowski, “Emotion recognition in formation and Computer Science at University of spontaneous speech using gmms.” in Interspeech, 2006. California, Irvine in 2005, an MS in Computer [26] Y. Tang, Y. Huang, Z. Wu, H. Meng, M. Xu, and L. Cai, “Question de- Networks and Distributed Computing from the Elec- tection from acoustic features using recurrent neural network with gated trical and Computer Engineering Department at UCI recurrent unit,” in 2016 IEEE International Conference on Acoustics, (2001), and a BE in Computer and Communica- Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 6125–6129. tions Engineering from the American University of [27] Y. Tang, Z. Wu, H. Meng, M. Xu, and L. Cai, “Analysis on gated Beirut (2000). His current research interests focus on recurrent unit based question detection approach,” Interspeech 2016, pp. energy-efficiency and mobility in networks. He has 735–739, 2016. over 100 peer-reviewed journal and conference publications, as well as a book [28] R. Rana, D. Austin, P. Jacobs, M. Karunanithi, and J. Kaye, “Gait published by Springer in 2007 titled Wireless Ad Hoc and Sensor Networks: velocity estimation using time-interleaved between consecutive passive A Cross-Layer Design Perspective. He regularly serves on the organizing and ir sensor activation,” 2013. technical program committees of international conferences (DCOSS, RTSS, [29] R. Rana, M. Yang, T. Wark, C. T. Chou, and W. Hu, “Simpletrack: Sensapp, Percomm, EWSN, ICDCS). Dr. Jurdak is an Adjunct Professor at adaptive trajectory compression with deterministic projection matrix for Macquarie University and James Cook University, and Adjunct Associate mobile sensor networks,” IEEE Sensors Journal, vol. 15, no. 1, pp. 365– Professor at the University of Queensland and the University of New South 373, 2015. Wales. He is a Senior Member of the IEEE. 9

Professor Xue Li is a Professor in Information Technology and Electrical Engineering at the Uni- versity of Queensland (UQ) in Brisbane Australia. He has more than 160 research articles published in ACM, IEEE journals and International conferences, books, book chapters, since 1993. Dr Xue Li was an Editor on Board, (2004-2007) of International Journal of Information Systems and Management & Technology, Elsevier, A Guest Editor for three issues of the journals: Journal of Global Information Management (JGIM), International Journal of Data Warehouse and Mining (IJDWM), and International Journal of Systems Science (IJSS). Xue has had more than 29 years experience in Information Technology. He is currently a guest professor in three Chinese 985 universities: Central South University of China, Chongqing University, and Xian Jiaotong University. Xue is the founder and Steering Committee Chair of a Southeast Asian International Conference on Advanced Data Mining and Applications (ADMA) 2004 2015. He has been invited as a keynote speaker on Big Data Analytics, Data Mining, and Web Information Systems in numerous international conferences.

Professor Roland Goecke is a Professor of Affec- tive Computing at the newly created (merged) Fac- ulty of Education, Science, Technology and Mathe- matics at the University of Canberra. Before that, he was an Associate Professor in Software Engineering at the same Faculty and prior to that an Assistant Professor (Senior Lecturer) at the Faculty of In- formation Sciences and Engineering, University of Canberra from Janaury 2010 until December 2012. He leads the Vision and Sensing Group and is the Deputy Director of the Human-Centred Computing Laboratory. His research focus continues to be in the areas of face and facial feature tracking and its applications, and more generally in Computer Vision, Affective Computing and Multimodal Human-Computer Interaction.

Professor Margot Brereton researches the partic- ipatory interaction design of ubiquitous computing technologies and their interfaces. She develops inno- vative designs, methods, and theoretical understand- ings by designing to support real user communities in selected challenging contexts. Her approach is highly iterative and often involves growing user communities as the design evolves, by understanding and responding to socio-cultural factors. Her broad area of research include Human-Computer Interac- tion, Participatory Design, Interaction Design, Com- puter Supported Cooperative Work, Design Methods, Ubiquitous Computing.

Professor Jeffrey Soar holds the Chair in Human- Centred Technology at the University of Southern Queensland where he researches technology innova- tion for human benefit. He came into research from a career at the highest levels of ICT management within large public and private organisations. He was CIO in several government departments in Australia and New Zealand including Director of Information and Technology for NZ Police where he managed the largest-ever ICT project impacting across emer- gency services. In academia he established two demonstration smart homes with the latest in innovations for independent living, entertainment, security and energy. His research has been supported by 7 Australian Research Council grants as well as over 30 grants from national and international technology and service organisations. His current research projects are in Technology for Economic Development, E-Learning and M-Learning, E-Government, E-Health, Decision Support, Mobile, Cloud, Algorithms, Adoption and Benefits Realisation, Human Computer Interaction and User Experience.