IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 1

Light Gated Recurrent Units for Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

1 Abstract—A field that has directly benefited from the recent systems and natural language processing, just to name a few. advances in is Automatic Speech Recognition Another field that has been transformed by this technology is (ASR). Despite the great achievements of the past decades, how- Automatic Speech Recognition (ASR) [12]. Thanks to modern ever, a natural and robust human-machine speech interaction still appears to be out of reach, especially in challenging environments Deep Neural Networks (DNNs), current speech recognizers are characterized by significant noise and reverberation. To improve now able to significantly outperform previous GMM-HMM robustness, modern speech recognizers often employ acoustic systems, allowing ASR to be applied in several contexts, such models based on Recurrent Neural Networks (RNNs), that are as web-search, intelligent personal assistants, car control and naturally able to exploit large time contexts and long-term speech radiological reporting. modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in Despite the progress of the last decade, state-of-the-art processing speech signals. speech recognizers are still far away from reaching satisfactory In this paper, we revise one of the most popular RNN models, robustness and flexibility. This lack of robustness typically namely Gated Recurrent Units (GRUs), and propose a simplified happens when facing challenging acoustic conditions [13], architecture that turned out to be very effective for ASR. The characterized by considerable levels of non-stationary noise contribution of this work is two-fold: First, we analyze the role played by the reset gate, showing that a significant redundancy and acoustic reverberation [14]–[22]. The development of with the update gate occurs. As a result, we propose to remove robust ASR has been recently fostered by the great success of the former from the GRU design, leading to a more efficient some international challenges such as CHiME [23], REVERB and compact single-gate model. Second, we propose to replace [24] and ASpIRE [25], which were also extremely useful to hyperbolic tangent with ReLU activations. This variation couples establish common evaluation frameworks among researchers. well with batch normalization and could help the model learn long-term dependencies without numerical issues. Currently, the dominant approach to automatic speech Results show that the proposed architecture, called Light GRU recognition relies on a combination of a discriminative DNN (Li-GRU), not only reduces the per-epoch training time by more and a generative Hidden Markov Model (HMM). The DNN than 30% over a standard GRU, but also consistently improves is normally employed for acoustic modeling purposes to the recognition accuracy across different tasks, input features, predict context-dependent phone targets. The acoustic-level noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end- predictions are later embedded in an HMM-based framework, to-end CTC models. that also integrates phone-transitions, lexicon, and language model information to retrieve the final sequence of words. Index Terms—speech recognition, deep learning, recurrent neural networks, LSTM, GRU An emerging alternative is end-to-end speech recognition, that aims to drastically simplify the current ASR pipeline by using fully discriminative systems that learn everything from I.INTRODUCTION data without (ideally) any additional human effort. Popular Deep learning is an emerging technology that is considered end-to-end techniques are attention models and Connectionist one of the most promising directions for reaching higher levels Temporal Classification (CTC). Attention models are based of artificial intelligence [1]. This paradigm is rapidly evolving on an encoder-decoder architecture coupled with an attention arXiv:1803.10225v1 [eess.AS] 26 Mar 2018 and some noteworthy achievements of the last years include, mechanism [26] that decides which input information to an- among the others, the development of effective regularization alyze at each decoding step. CTC [27] is based on a DNN methods [2], [3], improved optimization algorithms [4], and predicting symbols from a predefined alphabet (characters, better architectures [5]–[8]. The exploration of generative phones, words) to which an extra unit (blank) that emits no models [9], deep reinforcement learning [10] as well as labels is added. Similarly to HMMs, the likelihood (and its the evolution of sequence to sequence paradigms [11] also gradient with respect to the DNN parameters) are computed represent important milestones in the field. Deep learning is with dynamic programming by summing over all the paths that now being deployed in a wide range of domains, including bio- are possible realizations of the ground-truth label sequence. informatics, computer vision, machine translation, dialogue This way, CTC allows one to optimize the likelihood of the desired output sequence directly, without the need for an 1 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, explicit label alignment. including reprinting/republishing this material for advertising or promotional For both the aforementioned frameworks, Recurrent Neural purposes, creating new collective works, for resale or redistribution to servers Networks (RNNs) represent a valid alternative to standard or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/TETCI.2017.2762739 URL: http://ieeexplore.ieee.org/ feed-forward DNNs. RNNs, in fact, are more and more often document/8323308/ employed in speech recognition, due to their capabilities to IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2 properly manage time contexts and capture long-term speech recognition [35]–[39], speech enhancement [40], speech sep- modulations. aration [41], [42] as well as speech activity detection [43]. In the machine learning community, the research of novel Training RNNs, however, can be complicated by vanishing and powerful RNN models is a very active research topic. and exploding gradients, that might impair learning long- General-purpose RNNs such as Long Short Term Memories term dependencies [44]. Although exploding gradients can (LSTMs) [28] have been the subject of several studies and be tackled with simple clipping strategies [45], the vanishing modifications over the past years [7], [29], [30]. This evolution gradient problem requires special architectures to be properly has recently led to a novel architecture called Gated Recurrent addressed. A common approach relies on the so-called gated Unit (GRU) [8], that simplifies the complex LSTM cell design. RNNs, whose core idea is to introduce a gating mechanism for Our work continues these efforts by further revising GRUs. better controlling the flow of the information through the var- Differently from previous efforts, our primary goal is not to ious time-steps. Within this family of architectures, vanishing derive a general-purpose RNN, but to modify the standard gradient issues are mitigated by creating effective “shortcuts”, GRU design in order to better address speech recognition. in which the gradients can bypass multiple temporal steps. In particular, the major contribution of this paper is twofold: The most popular gated RNNs are LSTMs [28], that of- First, we propose to remove the reset gate from the network ten achieve state-of-the-art performance in several machine design. Similarly to [31], we found that removing it does learning tasks, including speech recognition [35]–[40]. LSTMs not significantly affect the system performance, also due rely on memory cells that are controlled by forget, input, and to a certain redundancy observed between update and reset output gates. Despite their effectiveness, such a sophisticated gates. Second, we propose to replace hyperbolic tangent (tanh) gating mechanism might result in an overly complex model. with Rectified Linear Unit (ReLU) activations [32] in the On the other hand, computational efficiency is a crucial issue state update equation. ReLU units have been shown to be for RNNs and considerable research efforts have recently been more effective than sigmoid non-linearities for feed-forward devoted to the development of alternative architectures [7], DNNs [33], [34]. Despite its recent success, this non-linearity [30], [46]. has largely been avoided for RNNs, due to the numerical A noteworthy attempt to simplify LSTMs has recently led to instabilities caused by the unboundedness of ReLU activations: a novel model called Gated Recurrent Unit (GRU) [8], [47], composing many times GRU layers (i.e., GRU units following that is based on just two multiplicative gates. In particular, an affine transformation) with sufficiently large weights can the standard GRU architecture is defined by the following lead to arbitrarily large state values. However, when coupling equations: our ReLU-based GRU with batch normalization [3], we did not experience such numerical issues. This allows us to take z = σ(W x + U h + b ), (1a) advantage of both techniques, that have been proven effective t z t z t−1 z to mitigate the vanishing gradient problem as well as to speed rt = σ(Wrxt + Urht−1 + br), (1b) up network training. het = tanh(Whxt + Uh(ht−1 rt) + bh), (1c) We evaluated our proposed architecture on different tasks, ht = zt ht−1 + (1 − zt) het. (1d) datasets, input features, noisy conditions as well as on different ASR frameworks (i.e., DNN-HMM and CTC). Results show where zt and rt are vectors corresponding to the update and that the revised architecture reduces the per-epoch training reset gates, respectively, while ht represents the state vector wall-clock time by more than 30%, while improving the for the current time frame t. Element-wise multiplications are recognition accuracy. Moreover, the proposed solution leads denoted with . The activations of both gates are logistic to a compact model, that is arguably easier to interpret, sigmoid functions σ(·), that constrain zt and rt to take values understand and implement, due to a simplified design based ranging from 0 and 1. The candidate state het is processed with on a single gate. a hyperbolic tangent. The network is fed by the current input The rest of the paper is organized as follows. Sec. II recalls vector xt (e.g., a vector of speech features), while the param- the standard GRU architecture, while Sec. III illustrates in eters of the model are the matrices Wz, Wr, Wh (the feed- detail the proposed model and the related work. In Sec. IV, a forward connections) and Uz, Ur, Uh (the recurrent weights). description of the adopted corpora and experimental setup is The architecture finally includes trainable bias vectors bz, br provided. The results are then reported in Sec. V. Finally, our and bh, that are added before the non-linearities are applied. conclusions are drawn in Sec. VI. As shown in Eq. 1d, the current state vector ht is a linear interpolation between the previous activation ht−1 and the current candidate state het. The weighting factors are set by the II.GATED RECURRENT UNITS update gate zt, that decides how much the units will update The most suitable architecture able to learn short and their activations. This linear interpolation is the key component long-term speech dependencies is represented by RNNs [12]. for learning long-term dependencies. If zt is close to one, RNNs, indeed, can potentially capture temporal information in in fact, the previous state is kept unaltered and can remain a very dynamic fashion, allowing the network to freely decide unchanged for an arbitrary number of time steps. On the other the amount of contextual information to use for each time hand, if zt is close to zero, the network tends to favor the step. Several works have already highlighted the effectiveness candidate state het, that depends more heavily on the current of RNNs in various speech processing tasks, such as speech input and on the closer hidden states. The candidate state het IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 3

of redundancy will be analyzed in a quantitative way in Sec. V using the cross-correlation metric C(z, r):

C(z, r) = zt ? rt (2)

where zt and rt are the average activations (over the neurons) of update and reset gates, respectively, and ? is the cross- correlation operator. Based on these reasons, the first variation to standard GRUs thus concerns the removal of the reset gate rt. This change leads to the following modification of Eq. 1c:

het = tanh(Whxt + Uhht−1 + bh) (3) Fig. 1: Average activations of the update and reset gates for a GRU trained on TIMIT in a chunk of the utterance “sx403” The main benefits of this intervention are related to the of speaker “faks0”. improved computational efficiency, that is achieved thanks to a more compact single-gate model. also depends on the reset gate rt, that allows the model to B. ReLU activations possibly delete the past memory by forgetting the previously The second modification consists in replacing the standard computed states. hyperbolic tangent with ReLU activation. In particular, we modify the computation of candidate state het (Eq. 1c), as III.A NOVEL GRU FRAMEWORK follows: The main changes to the standard GRU model concern the reset gate, ReLU activations, and batch normalization, as het = ReLU(Whxt + Uhht−1 + bh) (4) outlined in the next sub-sections. Standard tanh activations are less used in feedforward networks because they do not work as well as piecewise-linear A. Removing the reset gate activations when training deeper networks [48]. The adoption From the previous introduction to GRUs, it follows that the of ReLU-based neurons, that have shown to be effective in reset gate can be useful when significant discontinuities occur improving such limitations, was not so common in the past in the sequence. For language modeling, this may happen for RNNs. This was due to numerical instabilities originating when moving from one text to another that is not semantically from the unbounded ReLU functions applied over long time related. In such situation, it is convenient to reset the stored series. However, coupling this activation function with batch memory to avoid taking a decision biased by an unrelated normalization turned out to be helpful for taking advantage of history. ReLU neurons without numerical issues, as will be discussed Nevertheless, we believe that for some specific tasks like in the next sub-section. speech recognition this functionality might not be useful. In fact, a speech signal is a sequence that evolves rather slowly (the features are typically computed every 10 ms), in which C. Batch Normalization the past history can virtually always be helpful. Even in the Batch normalization [3] has been recently proposed in the presence of strong discontinuities, for instance observable at machine learning community and addresses the so-called inter- the boundary between a vowel and a fricative, completely nal covariate shift problem by normalizing the mean and the resetting the past memory can be harmful. On the other hand, it variance of each layer’s pre-activations for each training mini- is helpful to memorize phonotactic features, since some phone batch. Several works have already shown that this technique transitions are more likely than others. is effective both to improve the system performance and to We also argue that a certain redundancy in the activations speed-up the training procedure [20], [22], [37], [49], [50]. of reset and update gates might occur when processing speech Batch normalization can be applied to RNNs in different sequences. For instance, when it is necessary to give more ways. In [49], the authors suggest to apply it to feed-forward importance to the current information, the GRU model can connections only, while in [50] the normalization step is set small values of rt. A similar effect can be achieved with extended to recurrent connections, using separate statistics the update gate only, if small values are assigned to zt. for each time-step. In our work, we tried both approaches, The latter solution tends to weight more the candidate state but we did not observe substantial benefits when extending het, that depends heavily on the current input. Similarly, a batch normalization to recurrent parameters (i.e., Uh and Uz). high value can be assigned either to rt or to zt, in order For this reason, we applied this technique to feed-forward to place more importance on past states. This redundancy connections only (i.e., Wh and Wz), obtaining a more compact is also highlighted in Fig. 1, where a temporal correlation model that is almost equally performing but significantly in the average activations of update and reset gates can be less computationally expensive. When batch normalization is readily appreciated for a GRU trained on TIMIT. This degree limited to feed-forward connections, indeed, all the related IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 4

` `` Dataset. computations become independent at each time step and they ``` TIMIT DIRHA CHiME TED Features ``` can be performed in parallel. This offers the possibility to Hours 5 12 75 166 apply it with reduced computational efforts. As outlined in the Speakers 462 83 87 820 previous sub-section, coupling the proposed model with batch- Sentences 3696 7138 43690 272191 normalization [3] could also help in limiting the numerical TABLE I: Main features of the training datasets adopted in issues of ReLU RNNs. Batch normalization, in fact, rescales this work. the neuron pre-activations, inherently bounding the values of the ReLU neurons. In this way, our model concurrently takes advantage of the well-known benefits of both ReLU activation arising between reset and update gates. We then analyze and batch normalization. In our experiments, we found that the some gradient statistics, and we better study the impact of latter technique helps against numerical issues also when it is batch normalization. Moreover, we assess our approach on a limited to feed-forward connections only. larger variety of speech recognition tasks, considering several Formally, removing the reset gate, replacing the hyperbolic different datasets as well as noisy and reverberant conditions. tangent function with the ReLU activation, and applying batch Finally, we extend our experimental validation to a end-to-end normalization, now leads to the following model: CTC model.

IV. EXPERIMENTAL SETUP zt = σ(BN(Wzxt) + Uzht−1), (5a) In the following sub-sections, the considered corpora, the h = ReLU(BN(W x ) + U h ), (5b) et h t h t−1 RNN setting as well as the HMM-DNN and CTC setups are ht = zt ht−1 + (1 − zt) het. (5c) described. The batch normalization BN(·) works as described in [3], and is defined as follows: A. Corpora and tasks The main features of the corpora considered in this work a − µb BN(a) = γ + β (6) are summarized in Tab. I. pσ2 +  b A first set of experiments with the TIMIT corpus was where µb and σb are the minibatch mean and variance, re- performed to test the proposed model in a close-talking sce- spectively. A small constant  is added for numerical stability. nario. These experiments are based on the standard phoneme The variables γ and β are trainable scaling and shifting recognition task, which is aligned with that proposed in the parameters, introduced to restore the network capacity. Note Kaldi s5 recipe [53]. that the presence of β makes the biases bh and bz redundant. To validate our model in a more realistic scenario, a set of Therefore, they are omitted in Eq. 5a and 5b. experiments was also conducted in distant-talking conditions We called this architecture Light GRU (Li-GRU), to empha- with the DIRHA-English corpus2 [54]. The reference context size the simplification process conducted on a standard GRU. was a domestic environment characterized by the presence of non-stationary noise (with an average SNR of about 10dB) and T D. Related work acoustic reverberation (with an average reverberation time 60 of about 0.7 seconds). Training was based on the original WSJ- A first attempt to remove rt from GRUs has recently led to a 5k corpus (consisting of 7138 sentences uttered by 83 speak- single-gate architecture called Minimal Gated Recurrent Unit ers) that was contaminated with a set of impulse responses (M-GRU) [31], that achieves a performance comparable to that measured in a real apartment. The test phase was carried obtained by standard GRUs in handwritten digit recognition out with both real and simulated datasets, each consisting of as well as in a sentiment classification task. To the best of our 409 WSJ sentences uttered by six native American speakers. knowledge, our contribution is the first attempt that explores A development set of 310 WSJ sentences uttered by six this architectural variation in speech recognition. Recently, different speakers was also used for hyperparameter tuning. To some attempts have also been done for embedding ReLU units test our approach in different reverberation conditions, other in the RNN framework. For instance, in [51] authors replaced contaminated versions of the latter training and test data are tanh activations with ReLU neurons in a vanilla RNN, showing generated with different impulse responses. These simulations the capability of this model to learn long-term dependencies are based on the image method [55] and correspond to four when a proper orthogonal initialization is adopted. In this different reverberation times T60 (ranging from 250 to 1000 work, we extend the use of ReLU to a GRU architecture. ms). More details on the realistic impulse responses adopted In summary, the novelty of our approach consists in the in this corpus can be found in [56]–[58]. integration of three key design aspects (i.e, the removal of Additional experiments were conducted with the CHiME the reset gate, ReLU activations and batch normalization) in 4 dataset [39], that is based on both real and simulated data a single model, that turned out to be particularly suitable for recorded in four noisy environments (on a bus, cafe, pedestrian speech recognition. The potential benefits Li-GRUs have been area, and street junction). The training set is composed of preliminarily observed as part of a work on speech recognition 43690 noisy WSJ sentences recored by five microphones described in [52]. This study extends our previous effort in several ways. First of all, we better analyze the correlation 2This dataset is being distributed by the Linguistic Data Consortium (LDC). IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 5

(arranged on a tablet) and uttered by a total of 87 speakers. The main hyperparameters of the model (i.e., learning The development set (DT) is based on 3280 WSJ sentences rate, number of hidden layers, hidden neurons per layer, uttered by four speakers (1640 are real utterances referred to dropout factor) were optimized on the development data. In as DT-real, and 1640 are simulated denoted as DT-sim). The particular, we guessed some initial values according to our test set (ET) is based on 1320 real utterances (ET-real) and experience, and starting from them we performed a grid search 1320 simulated sentences (DT-real) from other four speakers. to progressively explore better configurations. A total of 20-25 The experiments reported in this paper are based on the single experiments were conducted for all the various RNN models. channel setting, in which the test phase is carried out with a single microphone (randomly selected from the considered C. DNN-HMM setup microphone setup). More information on CHiME data can be In the DNN-HMM experiments, the DNN is trained to found in [23]. predict context-dependent phone targets. The feature extraction To evaluate the proposed model on a larger scale ASR task, is based on blocking the signal into frames of 25 ms with some additional experiments were performed with the TED- an overlap of 10 ms. The experimental activity is conducted talk dataset, that was released in the context of the IWSLT considering different acoustic features, i.e., 39 MFCCs (13 evaluation campaigns [59]. The training set is composed of static+∆+∆∆), 40 log-mel filter-bank features (FBANKS), as 820 talks with a total of about 166 hours of speech. The well as 40 fMLLR features (extracted as reported in the s5 development test is composed of 81 talks (16 hours), while recipe of Kaldi [53]). the test sets (TST 2011 and TST 2012) are based on 8 talks The labels were derived by performing a forced alignment (1.5 hours) and 32 talks (6.5 hours), respectively. procedure on the original training datasets. See the standard s5 B. RNN setting recipe of Kaldi for more details [53]. During test, the posterior probabilities generated for each frame by the RNN are normal- The architecture adopted for the experiments consisted of ized by their prior probabilities. The obtained likelihoods are multiple recurrent layers, that were stacked together prior processed by an HMM-based decoder, that, after integrating to the final softmax classifier. These recurrent layers were the acoustic, lexicon and language model information in a bidirectional RNNs [35], which were obtained by concate- single search graph, finally estimates the sequence of words nating the forward hidden states (collected by processing the uttered by the speaker. The RNN part of the ASR system sequence from the beginning to the end) with backward hidden was implemented with Theano [63], that was coupled with the states (gathered by scanning the speech in the reverse time or- Kaldi decoder [53] to form a context-dependent RNN-HMM der). Recurrent dropout was used as regularization technique. speech recognizer3. Since extending standard dropout to recurrent connections hinders learning long-term dependencies, we followed the approach introduced in [60], [61], that tackles this issue by D. CTC setup sharing the same dropout mask across all the time steps. The models used for the CTC experiments consisted of 5 Moreover, batch normalization was adopted exploiting the layers of bidirectional RNNs of either 250 or 465 units. Unlike method suggested in [49], as discussed in Sec. II. The feed- in the other experiments, weight noise was used for regular- forward connections of the architecture were initialized ac- ization. The application of weight noise is a simplification of cording to the Glorot’s scheme [48], while recurrent weights adaptive weight noise [64] and has been successfully used be- were initialized with orthogonal matrices [51]. Similarly to fore to regularize CTC-LSTM models [65]. The weight noise [20], the gain factor γ of batch normalization was initialized was applied to all the weight matrices and sampled from a zero to 0.1 and the shift parameter β was initialized to 0. mean normal distribution with a standard deviation of 0.01. Before training, the sentences were sorted in ascending Batch normalization was used with the same initialization order according to their lengths and, starting from the short- settings as in the other experiments. Glorot’s scheme was used est utterance, minibatches of 8 sentences were progressively to initialize all the weights (also the recurrent ones). The input processed by the training algorithm. This sorting approach features for these experiments were 123 dimensional FBANK minimizes the need of zero-paddings when forming mini- features (40 + energy + ∆+∆∆). These features were also batches, resulting helpful to avoid possible biases on batch used in the original work on CTC-LSTM models for speech normalization statistics. Moreover, the sorting approach ex- recognition [65]. The CTC layer itself was trained on the 61 ploits a curriculum learning strategy [62] that has been shown label set. Decoding was done using the best-path method [27], to slightly improve the performance and to ensure numerical without adding any external phone-based language model. stability of gradients. The optimization was done using the After decoding, the labels were mapped to the final 39 label Adaptive Moment Estimation (Adam) algorithm [4] running set. The models were trained for 50 epochs using batches of for 22 epochs (35 for the TED-talk corpus) with β1 = 0.9, 8 utterances. −8 β2 = 0.999,  = 10 . The performance on the development set was monitored after each epoch, while the learning rate V. RESULTS was halved when the performance improvement went below a In the following sub-sections, we describe the experimen- certain threshold (th = 0.001). Gradient truncation was not tal activity conducted to assess the proposed model. Most applied, allowing the system to learn arbitrarily long time dependencies. 3The code is available at http://github.com/mravanelli/theano-kaldi-rnn/. IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 6

1 C(z,z) 0.7 0.8 C(z,r) 0.6

0.6 0.5

0.4 0.4

Correlation 0.3 0.2 Peak of C(z,r) 0.2 0 0.1 -500 0 500 0 2 4 6 8 10 Time (ms) # Epochs Fig. 2: Auto-correlation C(z, z) and cross-correlation C(z, r) Fig. 3: Evolution of the peak of the cross-correlation C(z, r) between the average activations of the update (z) and reset (r) over various training epochs. gates. Correlations are normalized by the maximum of C(z, z) X XX Arch. for graphical convenience. XX GRU M-GRU Li-GRU Param. XXX kWhk 4.80 4.84 4.86 kWzk 0.85 0.89 0.99 of the experiments reported in the following are based on kWrk 0.22 - - hybrid DNN-HMM speech recognizers, since the latter ASR kUhk 0.15 0.35 0.71 paradigm typically reaches state-of-the-art performance. How- kUzk 0.05 0.11 0.13 kU k 0.02 - - ever, for the sake of comparison, we also extended the exper- r imental validation to an end-to-end CTC model. More pre- TABLE II: L2 norm of the gradient for the main parameters cisely, in sub-section V-A, we first quantitatively analyze the of the GRU models. The norm is averaged over all the training correlations between the update and reset gates in a standard sentences and epochs. GRU. In sub-section V-B, we extend our study with some analysis of gradient statistics. The role of batch normalization and the CTC experiments are described in sub-sections V-C the peak of C(z, r) for some training epochs, showing that and V-D, respectively. The speech recognition performance the GRU attributes rather quickly a similar role to update will then be reported for TIMIT, DIRHA-English, CHiME as and reset gates. In fact, after 3-4 epochs, the correlation peak well as for the TED-talk corpus in sub-sections V-E, V-F, V-G, reaches its maximum value, that is almost maintained for all and V-H, respectively. The computational benefits of Li-GRU the subsequent training iterations. are finally discussed in sub-section V-I. B. Gradient analysis A. Correlation analysis The analysis of the main gradient statistics can give some The correlation between update and reset gates have been preliminary indications on the role played by the various preliminarily discussed in Sec. II. In this sub-section, we take parameters. a step further by analyzing it in a quantitative way using With this goal, Table II reports the L2 norm of the gradient the cross-correlation function defined in Eq. 2. In particular, for the main parameters of the considered GRU models. the cross-correlation C(z, r) between the average activations Results reveal that the reset gate weight matrices (i.e., Wr of update z and reset r gates is shown in Fig. 2. The gate and Ur) have a smaller gradient norm when compared to activations are computed for all the input frames and, at each the other parameters. This result is somewhat expected, since time step, an average over the hidden neurons is considered. the reset gate parameters are processed by two different non- The cross-correlation C(z, r) is displayed along with the auto- linearities (i.e., the sigmoid of Eq. 1b and the tanh of 1c), correlation C(z, z), that represents the upper-bound limit of that can attenuate their gradients. This would anyway indicate the former function. that, on average, the reset gate has less impact on the final Fig. 2 clearly shows a high peak of C(z, r), revealing that cost function, further supporting its removal from the GRU update and reset gates end up being redundant. This peak is design. When avoiding it, the norm of the gradient tends to about 66% of the maximum of C(z, z) and it is centered at increase (see for instance the recurrent weights Uh of M- t = 0, indicating that almost no-delay occurs between gate GRU model). This suggests that the functionalities of the activations. This result is obtained with a single-layer GRU of reset gate, can be performed by other model parameters. The 200 bidirectional neurons fed with MFCC features and trained norm further increases in the case of Li-GRUs, due to the with TIMIT. After the training-step, the cross-correlation is adoption of ReLu units. This non-linearity, indeed, improves averaged over all the development sentences. the back-propagation of the gradient over both time-steps and It would be of interest to examine the evolution of this hidden layers, making long-term dependencies easier to learn. correlation over the epochs. With this regard, Fig. 3 reports Results are obtained with the same GRU used in subsection IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 7

X ` XX Arch. `` Batch-norm. XX GRU M-GRU Li-GRU ``` False True Param. XXX Arch. ``` without batch norm 18.4 18.6 20.4 GRU 22.1 22.0 with batch norm 17.1 17.2 16.7 Li-GRU 21.1 20.9 TABLE III: PER(%) of the GRU models with and without TABLE IV: PER(%) obtained for the test set of TIMIT with batch normalization (TIMIT dataset, MFCC features). various CTC RNN architectures. X XX Feat. XX MFCC FBANK fMLLR Arch. XXX V-A, considering M-GRU and Li-GRU models with the same relu-RNN 18.7 ± 0.18 18.3 ± 0.23 16.3 ± 0.11 number of hidden units. LSTM 18.1 ± 0.33 17.1 ± 0.36 15.7 ± 0.32 GRU 17.1 ± 0.20 16.7 ± 0.36 15.3 ± 0.28 M-GRU 17.2 ± 0.11 16.7 ± 0.19 15.2 ± 0.10 Li-GRU 16.7 ± 0.26 15.8 ± 0.10 14.9 ± 0.27 C. Role of batch normalization TABLE V: PER(%) obtained for the test set of TIMIT with After the preliminary analysis on correlation and gradients various RNN architectures. done in previous sub-sections, we now compare RNN models in terms of their final speech recognition performance. To highlight the importance of batch normalization, Table III E. Other results on TIMIT compares the Phone Error Rate (PER%) achieved with and without this technique. The results of Table III and IV highlighted that the proposed Results show that batch normalization is helpful to improve Li-GRU model outperforms other GRU architectures. In this the ASR performance, leading to a relative improvement of sub-section, we extend this study by performing a more about 7% for GRU and M-GRU and 18% for the proposed Li- detailed comparison with the most popular RNN architectures. GRU. The latter improvement confirms that our model couples To provide a fair comparison, batch normalization is here- particularly well with this technique, due to the adopted ReLU inafter applied to all the considered RNN models. Moreover, activations. Without batch normalization, the ReLU activations at least five experiments varying the initialization seeds were of the Li-GRU are unbounded and tend to cause numerical conducted for each RNN architecture. The results are thus instabilities. According to our experience, the convergence of reported as the average PER with their corresponding standard Li-GRU without batch normalization, can be achieved only deviation. by setting rather small learning rate values. The latter setting, Table V presents the ASR performance obtained with the however, can lead to a poor performance and, as clearly TIMIT dataset. The first row reports the results achieved with emerged from this experiment, coupling Li-GRU with this a simple RNN with ReLU activations (no gating mechanisms technique is strongly recommended. are used here). Although this architecture has recently shown promising results in some machine learning tasks [51], our results confirm that gated recurrent networks (rows 2-5) out- perform traditional RNNs. We also observe that GRUs tend D. CTC results to slightly outperform the LSTM model. As expected, M- Table IV summarizes the results of CTC on the TIMIT GRU (i.e., the architecture without reset gate) achieves a data set. In these experiments, the Li-GRU clearly outperforms performance very similar to that obtained with standard GRUs, the standard GRU, showing the effectiveness of the proposed further supporting our speculation on the redundant role played model even in a end-to-end ASR framework. The improvement by the reset gate in a speech recognition application. The is obtained both with and without batch normalization and, last row reports the performance achieved with the proposed similarly to what observed for hybrid systems, the latter model, in which, besides removing the reset gate, ReLU technique leads to better performance when coupled with Li- activations are used. The Li-GRU performance indicates that GRU. However, a smaller performance gain is observed when our architecture consistently outperforms the other RNNs over batch normalization is applied to the CTC. This result could all the considered input features. A remarkable achievement is also be related to the different choice of the regularizer, as the average PER(%) of 14.9% obtained with fMLLR features. weight noise was used instead of recurrent dropout. To the best of our knowledge, this result yields the best In general, PERs are higher than those of the hybrid published performance on the TIMIT test-set. systems. End-to-end methods, in fact, are relatively young In Table VI the PER(%) performance is split into five models, that are currently still less competitive than more different phonetic categories (vowels, liquids, nasals, fricatives complex (and mature) DNN-HMM approaches. We believe and stops), showing that Li-GRU exhibits the best results for that the gap between CTC and hybrid speech recognizers could all the considered classes. be partially reduced in our experiments with a more careful Previous results are obtained after optimizing the main setting of the hyperparameters and with the introduction of an hyperparameters of the model on the development set. Table external phone-based language model. The main focus of the VII reports the outcome of this optimization process, with the paper, however, is to show the effectiveness of the proposed Li- corresponding best architectures obtained for each RNN ar- GRU model, and a fair comparison between CTC and hybrid chitecture. For GRU models, the best performance is achieved systems is out of the scope of this work. with 5 hidden layers of 465 neurons. It is also worth noting IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 8

Phonetic Cat. Phone Lists GRU Li-GRU Vowels {iy,ih,eh,ae,...,oy,aw,ow,er} 23.2 23.0 Liquids {l,r,y,w,el} 20.1 19.0 18 GRU Nasals {en,m,n,ng} 16.8 15.9 Li-GRU Fricatives {ch,jh,dh,z,v,f,th,s,sh,hh,zh} 17.6 17.0 16 Stops {b,d,g,p,t,k,dx,cl,vcl,epi} 18.5 17.9

TABLE VI: PER(%) of the TIMIT dataset (MFCC features) 14 split into five different phonetic categories. Silences (sil) are WER(%) not considered here. 12 Architecture Layers Neurons # Params relu-RNN 4 607 6.1 M 10 LSTM 5 375 8.8 M GRU 5 465 10.3 M 200 400 600 800 1000 M-GRU 5 465 7.4 M T (ms) Li-GRU 5 465 7.4 M 60 TABLE VII: Optimal number of layers and neurons for each Fig. 4: Evolution of the WER(%) for the DIRHA WSJ TIMIT RNN model. The outcome of the optimization process simulated data over different reverberation times T60. is similar for all considered features. PP Dataset P DT-sim DT-real ET-sim ET-real Arch. PP XX Feat. P XX MFCC FBANK fMLLR DNN 17.8 ± 0.38 16.1 ± 0.31 26.1 ± 0.45 30.0 ± 0.48 Arch. XX XX DNN+sMBR 14.7 ± 0.25 15.7 ± 0.23 24.0 ± 0.31 27.0 ± 0.35 ± ± ± relu-RNN 23.7 0.21 23.5 0.30 18.9 0.26 GRU 15.8 ± 0.30 14.8 ± 0.25 23.0 ± 0.38 23.3 ± 0.35 ± ± ± LSTM 23.2 0.46 23.2 0.42 18.9 0.24 M-GRU 15.9 ± 0.32 14.1 ± 0.31 22.8 ± 0.39 23.0 ± 0.41 ± ± ± GRU 22.3 0.39 22.5 0.38 18.6 0.23 Li-GRU 13.5 ± 0.25 12.5 ± 0.22 20.3 ± 0.31 20.0 ± 0.33 M-GRU 21.5 ± 0.43 22.0 ± 0.37 18.0 ± 0.21 Li-GRU 21.3 ± 0.38 21.4 ± 0.32 17.6 ± 0.20 TABLE X: Speech recognition performance on the CHiME TABLE VIII: Word Error Rate (%) obtained with the DIRHA dataset (single channel, fMLLR features). English WSJ dataset (simulated part) for various RNN archi- tectures. performance reported in Table V highlights comparable error X XX Feat. rates between standard GRU and M-GRU, in the distant- XX MFCC FBANK fMLLR Arch. XXX talking case we even observe a small performance gain when relu-RNN 29.7 ± 0.31 30.0 ± 0.38 24.7 ± 0.28 LSTM 29.5 ± 0.41 29.1 ± 0.42 24.6 ± 0.35 removing the reset gate. We suppose that this behaviour is due GRU 28.5 ± 0.37 28.4 ± 0.21 24.0 ± 0.27 to reverberation, that implicitly introduces redundancy in the M-GRU 28.4 ± 0.34 28.1 ± 0.30 23.6 ± 0.21 signal, due to the multiple delayed replicas of each sample. Li-GRU 27.8 ± 0.38 27.6 ± 0.36 22.8 ± 0.26 This results in a forward memory effect, that can make reset TABLE IX: Word Error Rate (%) obtained with the DIRHA gate ineffective. English WSJ dataset (real part) for various RNN architectures. In Fig. 4, we extend our previous experiments by generating simulated data with different reverberation times T60 ranging from 250 to 1000 ms, as outlined in Sec. IV-A. In order to that M-GRU and Li-GRU have about 30% fewer parameters simulate a more realistic situation, different impulse responses compared to the standard GRU. have been used for training and testing purposes. No additive noise is considered for these experiments. F. Recognition performance on DIRHA English WSJ As expected, the performance degrades as the reverberation time increases. Similarly to previous achievements, we still After a first set of experiments on TIMIT, in this sub-section observe that Li-GRU outperform GRU under all the considered we assess our model on a more challenging and realistic reverberation conditions. distant-talking task, using the DIRHA English WSJ corpus. A challenging aspect of this dataset is the acoustic mismatch between training and testing conditions. Training, in fact, is G. Recognition performance on CHiME performed with a reverberated version of WSJ, while test is In this sub-section we extend the results to the CHiME characterized by both non-stationary noises and reverberation. corpus, that is an important benchmark in the ASR field, Tables VIII and IX summarize the results obtained with the thanks to the success of CHiME challenges [23], [66]. In simulated and real parts of this dataset. Tab. X a comparison across the various GRU architectures These results exhibit a trend comparable to that observed for is presented. For the sake of comparison, the results obtained TIMIT, confirming that Li-GRU still outperform GRU even in with the official CHiME 4 are also reported in the first two a more challenging scenario. The results are consistent over rows4. both real and simulated data as well as across the different features considered in this study. 4The results obtained in this section are not directly comparable with the best systems of the CHiME 4 competition. Due to the purpose of this work, The reset gate removal seems to play a more crucial role indeed, techniques such as multi-microphone processing, data-augmentation, in the addressed distant-talking scenario. If the close-talking system combination as well as lattice rescoring are not used here. IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 9

X X XX Env. XX Dataset. XX BUS CAF PED STR XX TIMIT DIRHA CHiME TED Arch. XXX Arch. XXX DNN 44.1 32.0 26.2 17.7 GRU 9.6 min 40 min 312 min 590 min DNN+sMBR 40.5 28.3 22.9 16.3 Li-GRU 6.5 min 25 min 205 min 447 min GRU 33.5 25.6 19.5 14.6 M-GRU 33.1 24.9 19.2 14.9 TABLE XIII: Per-epoch training time (in minutes) of GRU Li-GRU 28.0 22.1 16.9 13.2 and Li-GRU models for the various datasets on an NVIDIA TABLE XI: Comparison between GRU and Li-GRU for the K40 GPU. four different noisy conditions considered in CHiME on the real evaluation set (ET-real). The training time reduction achieved with the proposed ` `` Dataset. architecture is about 30% for all the datasets. This reduction ``` TST-2011 TST-2012 Arch. ``` reflects the amount of parameters saved by Li-GRU, that is also GRU 16.3 ± 0.13 17.0 ± 0.16 around 30%. The reduction of the computational complexity, Li-GRU 13.8 ± 0.12 14.4 ± 0.15 originated by a more compact model, also arises for testing TABLE XII: Comparison between GRU and Li-GRU with the purposes, making our model potentially suitable for small- TED-talks corpus. footprint ASR, [70]–[75], which studies DNNs designed for portable devices with small computational capabilities.

Results confirm the trend previously observed, highlighting VI.CONCLUSIONS a significant relative improvement of about 14% achieved when passing from GRU to the proposed Li-GRU. Similarly In this paper, we revised standard GRUs for speech recog- to our findings of the previous section, some small benefits nition purposes. The proposed Li-GRU architecture is a sim- can be observed when removing the reset gate. The largest plified version of a standard GRU, in which the reset gate is performance gap, however, is reached when adopting ReLU removed and ReLU activations are considered. Batch normal- units (see M-GRU and Li-GRU columns), confirming the ization is also used to further improve the system performance effectiveness of this architectural variation. Note also that the as well as to limit the numerical instabilities originated from GRU systems significantly outperform the DNN baseline, even ReLU non-linearities. when the latter is based on sequence discriminative training The experiments, conducted on different ASR paradigms, (DNN+sMBR) [67]. tasks, features and environmental conditions, have confirmed Tab. XI splits the ASR performance of the real test set into the effectiveness of the proposed model. The Li-GRU, in fact, the four noisy categories. Li-GRU outperforms GRU in all not only yields a better recognition performance, but also the considered environments, with a performance gain that is reduces the computational complexity, with a reduction of higher when more challenging acoustic conditions are met. more than 30% of the training time over a standard GRU. For instance, we obtain a relative improvement of 16% in Future efforts will be focused on extending this work to the bus (BUS) environment (the noisiest), against the relative other speech-based tasks, such as speech enhancement and improvement of 9.5% observed in the street (STR) recordings. speech separation as well as to explore the use of Li-GRU in other possible fields. H. Recognition performance on TED-talks ACKNOWLEDGMENTS Tab. XII reports a comparison between GRU and Li-GRU on the TED-talks corpus. The experiments are performed with The authors would like to thank Piergiorgio Svaizer for his standard MFCC features, and a four-gram language model is insightful suggestions on an earlier version of this paper. We considered in the decoding step (see [68] for more details). would also thank the anonymous reviewers for their careful Results on both test sets consistently shows the performance reading of our manuscript and their helpful comments. gain achieved with the proposed architecture. This further We gratefully acknowledge the support of NVIDIA Cor- confirms the effectiveness of Li-GRU, even for a larger scale poration with the donation of a Tesla K40 GPU used for ASR task. In particular, a relative improvement of about 14- this research. Computations were also made on the Helios 17% is achieved. This improvement is statistically significant supercomputer from the University of Montreal, managed by according to the Matched Pairs Sentence Segment Word Error Calcul Qubec and Compute Canada. Test (MPSSW) [69], that is conducted with NIST sclite sc stat tool with a p-value of 0.01. REFERENCES [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org. I. Training time comparison [2] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- In the previous subsections, we reported several speech dinov, “Dropout: A simple way to prevent neural networks from overfit- ting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, recognition results, showing that Li-GRU outperforms other 2014. RNNs. In this sub-section, we finally focus on another key [3] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep aspect of the proposed architecture, namely its improved network training by reducing internal covariate shift,” in Proc. of ICML, 2015, pp. 448–456. computational efficiency. In Table XIII, we compare the per- [4] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” epoch wall-clock training time of GRU and Li-GRU models. in Proc. of ICLR, 2015. IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 10

[5] O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, [30] K. Greff, R. K. Srivastava, J. Koutnk, B. R. Steunebrink, and J. Schmid- and D. Yu, “Convolutional neural networks for speech recognition,” huber, “LSTM: A Search Space Odyssey,” IEEE Transactions on Neural IEEE/ACM Transactions on Audio, Speech, and Language Processing, Networks and Learning Systems, vol. PP, no. 99, pp. 1–11, 2016. vol. 22, no. 10, pp. 1533–1545, Oct 2014. [31] G. Zhou, J. Wu, C. Zhang, and Z. Zhou, “Minimal gated [6] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network unit for recurrent neural networks,” 2016. [Online]. Available: architecture for efficient modeling of long temporal contexts,” in Proc http://arxiv.org/abs/1603.09420 of Interspeech, 2015, pp. 3214–3218. [32] V. Nair and G. E. Hinton, “Rectified linear units improve restricted [7] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. R. Glass, “High- boltzmann machines,” in Proc. of ICML, 2010, pp. 807–814. way long short-term memory RNNS for distant speech recognition,” in [33] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification Proc. of ICASSP, 2016, pp. 5755–5759. with deep convolutional neural networks,” in Proc. of NISP, 2012, pp. [8] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the 1097–1105. properties of neural machine translation: Encoder-decoder approaches,” [34] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- in Proc. of SSST, 2014. trained deep neural networks for large-vocabulary speech recognition,” [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in no. 1, pp. 30–42, 2012. Proc. of NIPS, 2014, pp. 2672–2680. [35] A. Graves, N. Jaitly, and A. Mohamed., “Hybrid speech recognition with [10] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Deep Bidirectional LSTM,” in Proc of ASRU, 2013. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc- [36] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, architectures for large scale acoustic modeling,” T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, in Proc. of Interspeech, 2014, pp. 338–342. “Mastering the game of Go with deep neural networks and tree search,” [37] D. Amodei et al., “Deep speech 2: End-to-end speech recognition in Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016. english and mandarin,” 2015. [Online]. Available: http://arxiv.org/abs/ [11] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by 1512.02595 jointly learning to align and translate,” in Proc. of ICLR, 2015. [38] Z. Chen, S. Watanabe, H. Erdogan, and J. Hershey, “Speech enhance- [12] D. Yu and L. Deng, Automatic Speech Recognition - A Deep Learning ment and recognition using multi-task learning of long short-term Approach. Springer, 2015. memory recurrent neural networks,” in Proc. of Interspeech, 2015, pp. [13] E. Hansler¨ and G. Schmidt, Speech and Audio Processing in Adverse 3274–3278. Environments. Springer, 2008. [39] T. Menne, J. Heymann, A. Alexandridis, K. Irie, A. Zeyer, M. Kitza, [14] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of P. Golik, I. Kulikov, L. Drude, R. Schluter,¨ H. Ney, R. Haeb-Umbach, noise-robust automatic speech recognition,” IEEE/ACM Transactions on and A. Mouchtaris, “The RWTH/UPB/FORTH System Combination for Audio, Speech and Language Processing., vol. 22, no. 4, pp. 745–777, the 4th CHiME Challenge Evaluation,” in CHiME 4 challenge, 2016, 2014. pp. 39–44. [15] P. Swietojanski, A. Ghoshal, and S. Renals, “Hybrid acoustic models for [40] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. distant and multichannel large vocabulary speech recognition,” in Proc. Hershey, and B. W. Schuller, “Speech enhancement with LSTM recur- of ASRU, 2013, pp. 285–290. rent neural networks and its application to noise-robust ASR,” in Proc. of LVA/ICA, 2015, pp. 91–99. [16] Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends on [41] F. Weninger, J. Le Roux, J. Hershey, and B. Schuller, “Discriminatively far field multiple microphones based speech recognition,” in Proc. of trained recurrent neural networks for single-channel speech separation,” ICASSP, 2014, pp. 5542–5546. in Proc. of IEEE GlobalSIP, 2014. [17] F. Weninger, S. Watanabe, J. Le Roux, J. Hershey, Y. Tachioka, J. Geiger, [42] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive B. Schuller, and G. Rigoll, “The MERL/MELCO/TUM System for the and recognition-boosted speech separation using deep recurrent neural REVERB Challenge Using Deep Recurrent Neural Network Feature networks,” in Proc. of ICASSP, 2015, pp. 708–712. Enhancement,” in IEEE REVERB Workshop, 2014. [43] F. Eyben, F. Weninger, S. Squartini, and B. Schuller, “Real-life voice [18] M. Mimura, S. Sakai, and T. Kawahara, “Reverberant speech recognition activity detection with LSTM Recurrent Neural Networks and an appli- combining deep neural networks and deep autoencoders,” in IEEE cation to Hollywood movies,” in Proc. of ICASSP, 2013, pp. 483–487. REVERB Workshop, 2014. [44] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependen- [19] A. Schwarz, C. Huemmer, R. Maas, and W. Kellermann, “Spatial cies with gradient descent is difficult,” IEEE Transaction on Neural Diffuseness Features for DNN-Based Speech Recognition in Noisy and Networks, vol. 5, no. 2, pp. 157–166, Mar. 1994. Reverberant Environments,” in Proc. of ICASSP, 2015, pp. 4380–4384. [45] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training [20] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Batch-normalized recurrent neural networks.” in Proc. of ICML, 2013, pp. 1310–1318. joint training for dnn-based distant speech recognition,” in Proc. of SLT, [46] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration 2016, pp. 28–34. of recurrent network architectures,” in Proc. of ICML, 2015, pp. 2342– [21] M. Ravanelli and M. Omologo, “Contaminated speech training methods 2350. for robust DNN-HMM distant speech recognition,” in Proc. of Inter- [47] J. Chung, C¸. Gulc¸ehre,¨ K. Cho, and Y. Bengio, “Empirical evaluation speech, 2015, pp. 756–760. of gated recurrent neural networks on sequence modeling,” in Proc. of [22] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “A network of deep NIPS, 2014. neural networks for distant speech recognition,” in Proc. of ICASSP, [48] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep 2017, pp. 4880–4884. feedforward neural networks,” in Proc. of AISTATS, 2010, pp. 249–256. [23] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third CHiME [49] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch Speech Separation and Recognition Challenge: Dataset, task and base- normalized recurrent neural networks,” in Proc. of ICASSP, 2016, pp. lines,” in Proc. of ASRU, 2015, pp. 504–511. 2657–2661. [24] K. Kinoshita et al., “The reverb challenge: A Common Evalua- [50] T. Cooijmans, N. Ballas, C. Laurent, and A. Courville, “Recurrent tion Framework for Dereverberation and Recognition of Reverberant batch normalization,” 2016. [Online]. Available: http://arxiv.org/abs/ Speech,” in Proc. of WASPAA, 2013, pp. 1–4. 1603.09025 [25] M. Harper, “The Automatic Speech recognition In Reverberant Environ- [51] Q. Le, N. Jaitly, and G. Hinton, “A simple way to initialize ments (ASpIRE) challenge,” in Proc. of ASRU, 2015, pp. 547–554. recurrent networks of rectified linear units,” 2015. [Online]. Available: [26] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End- http://arxiv.org/abs/1504.00941 to-End Attention-based Large Vocabulary Speech Recognition,” in Proc. [52] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Improving speech of ICASSP, 2016, pp. 4945–4949. recognition by revising gated recurrent units,” in Proc. of Interspeech, [27] A. Graves, S. Fernandez,´ F. Gomez, and J. Schmidhuber, “Connection- 2017. ist temporal classification: Labelling unsegmented sequence data with [53] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, recurrent neural networks,” in Proc. of ICML, 2006, pp. 369–376. M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- [28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural mer, and K. Vesely, “The Kaldi Speech Recognition Toolkit,” in Proc. Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. of ASRU, 2011. [29] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” [54] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and in Proc. of the INNS-ENNS, 2000, pp. 189–194 vol.3. M. Omologo, “The DIRHA-English corpus and related tasks for distant- IEEE JOURNAL OF EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 11

speech recognition in domestic environments,” in Proc. of ASRU, 2015, pp. 275–282. [55] J. Allen and D. Berkley, “Image method for efficiently simulating smallroom acoustics,” in J. Acoust. Soc. Am, 1979, pp. 2425–2428. [56] M. Ravanelli, P. Svaizer, and M. Omologo, “Realistic multi-microphone data simulation for distant speech recognition,” in Proc. of Interspeech, 2016, pp. 2786–2790. [57] M. Ravanelli, A. Sosi, P. Svaizer, and M. Omologo, “Impulse response estimation for robust speech recognition in a reverberant environment,” in Proc. of EUSIPCO, 2012, pp. 1668–1672. [58] M. Ravanelli and M. Omologo, “On the selection of the impulse responses for distant-speech recognition based on contaminated speech training,” in Proc. of Interspeech 2014, pp. 1028–1032. [59] M. Federico, L. Bentivogli, M. Paul, and S. Stuker,¨ “Overview of the IWSLT 2011 evaluation campaign,” in Proc. of IWSLT, 2011. [60] T. Moon, H. Choi, H. Lee, and I. Song, “RNNDROP: A novel dropout for RNNS in ASR,” in Proc. of ASRU, 2015, pp. 65–70. [61] Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” in Proc. of NIPS, 2016. [62] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proc. of ICML, 2009, pp. 41–48. [63] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [64] A. Graves, “Practical variational inference for neural networks,” in Proc. of NIPS, 2011, pp. 2348–2356. [65] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. of ICASSP, 2013, pp. 6645–6649. [66] J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, “The PAS- CAL CHiME speech separation and recognition challenge.” Computer Speech and Language, vol. 27, no. 3, pp. 621–633, 2013. [67] K. Vesely,´ A. Ghoshal, L. Burget, and D. Povey, “Sequence- discriminative training of deep neural networks,” in Proc. of Interspeech, 2013, pp. 2345–2349. [68] N. Ruiz, A. Bisazza, F. Brugnara, D. Falavigna, D. Giuliani, S. Jaber, R. Gretter, and M. Federico, “FBK-IWSLT 2011,” in Proc. of IWSLT, 2011. [69] L. Gillick and S. Cox, “Some statistical issues in the comparison of speech recognition algorithms,” in Proc. of ICASSP, 1989, pp. 532–535. [70] L. Lu and S. Renals, “Small-footprint Deep Neural Networks with Highway Connections for Speech Recognition,” in Proc. of Interspeech, 2016, pp. 12–16. [71] Y. Wang, J. Li, and Y. Gong, “Small-footprint high-performance deep neural network-based speech recognition using split-VQ,” in Proc. of ICASSP, 2015, pp. 4984–4988. [72] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proc. of ICASSP, 2014, pp. 4087–4091. [73] T. N. Sainath and C. Parada, “Convolutional neural networks for small- footprint keyword spotting,” in Proc. of Interspeech, 2015, pp. 4087– 4091. [74] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Naka- mura, and J. Yamato, “Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 499–513, 2012. [75] X. Lei, A. Senior, A. Gruenstein, and J. Sorensen, “Accurate and compact large vocabulary speech recognition on mobile devices.” in Proc. of Interspeech, 2013, pp. 662–665.