Generalization Error Bounds for Deep Unfolding RNNs

Boris Joukovsky ID 1,2 Tanmoy Mukherjee ID 1,2 Huynh Van Luong ID 1,2 Nikos Deligiannis ID 1,2

1Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium 2imec, Kapeldreef 75, B-3001 Leuven, Belgium

Abstract result, such models lack in theoretical understanding and interpretability. A step towards post-hoc interpretability of RNNs was made by Karpathy et al. [2015], where they Recurrent Neural Networks (RNNs) are powerful observed the advantages of internal states of the long short- models with the ability to model sequential data. term memory (LSTM) [Hochreiter and Schmidhuber, 1997] However, they are often viewed as black-boxes and over traditional RNNs to model interpretable patterns in the lack in interpretability. Deep unfolding methods input data; this study was further complemented by Gold- take a step towards interpretability by designing berg [2015]. Subsequent works in explainable deep neural networks as learned variations of itera- have also proposed methods to explain the decision-making tive optimization algorithms to solve various signal process of a trained LSTM for a given prediction, in terms processing tasks. In this paper, we explore theo- of the most influential inputs [Li et al., 2016, Arras et al., retical aspects of deep unfolding RNNs in terms 2019]. of their generalization ability. Specifically, we de- rive generalization error bounds for a class of deep Efforts toward a theoretical understanding of RNNs also unfolding RNNs via Rademacher complexity anal- involve the derivation of bounds on the generalization er- ysis. To our knowledge, these are the first general- ror, i.e., the difference between the empirical loss and the ization bounds proposed for deep unfolding RNNs. expected loss. Such bounds rely on measures—such as the We show theoretically that our bounds are tighter Rademacher complexity [Bartlett and Mendelson, 2002], than similar ones for other recent RNNs, in terms the PAC-Bayes theory [McAllester, 2003], the algorithm of the number of timesteps. By training models in stability [Bousquet and Elisseeff, 2002] and robustness [Xu a classification setting, we demonstrate that deep and Mannor, 2012]—and aim at understanding how accu- unfolding RNNs can outperform traditional RNNs rately a model is able to generalise to unseen data in relation in standard sequence classification tasks. These to the optimization algorithm used for training, the network experiments allow us to relate the empirical gen- architecture, or the underlying data structure. eralization error to the theoretical bounds. In par- Generalization error bounds (GEBs) for RNN models ticular, we show that over-parametrized deep un- were studied by Arora et al. [2018], Wei and Ma [2019] folding models like reweighted-RNN achieve tight and Akpinar et al. [2019]. The theoretical result by Arora theoretical error bounds with minimal decrease in et al. [2018] for low-dimensional embeddings using LSTMs accuracy, when trained with explicit regularization. models was derived for classification tasks. That work showed an interesting property of embeddings such as GloVe and , that is, forming a good sensing matrix 1 INTRODUCTION for text leads to better representations than those obtained with random matrices. The data-dependent sample complex- The past few years has seen an undeniable success of Recur- ity of deep neural networks was studied by Wei and Ma rent Neural Networks (RNNs) in applications ranging from [2019]. Unlike the existing Rademacher complexity bounds machine translation [Bahdanau et al., 2015] and image cap- that depend on norms of the weight matrices and depend tioning [Karpathy and Fei-Fei, 2017] to speech processing exponentially on the model depth [Bartlett and Mendelson, [Oord et al., 2016]. Despite their impressive performance, 2002, Dziugaite and Roy, 2017, Neyshabur et al., 2018], the existing approaches are typically designed based on heuris- study by Wei and Ma [2019] derived tighter Radermacher tics of the task and rely on engineering experience. As a bounds with the consideration of additional data-dependant

Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). properties controlling the norms of hidden layers and the the ` -` -RNN [Le et al., 2019]. We also present an 1 1 norms of the Jacobians of each layer with respect to all extension of the GEBs when the RNN are used in a clas- previous layers. The derivation methods by Wei and Ma sification setting. We show theoretically that the pro- [2019] were extended to provide GEBs for RNNs scaling posed bounds are tighter than other recent lightweight polynomial with respect to the depth of the model. Specifi- RNNs with generalization guarantees, namely, Spec- cally, generalization abilities were explored by Akpinar et al. tralRNN [Zhang et al., 2018] and FastGRNN [Kusupati [2019], where explicit sample complexity bounds for single et al., 2018], in terms of number of time steps T . and multi-layer RNNs were derived. • We assess the performance of deep unfolding RNNs The focus of this work is in studying how model in classification settings, namely in speech command interpretability—by means of designing networks based detection and language modelling tasks. For the first on deep unfolding—interacts with model generalization task, our results show that reweighted-RNN outper- capacity. Deep unfolding methods design deep models as forms other lightweight RNNs, and that deep unfold- unrolled iterations of optimization algorithms; a key result ing RNNs are competitive with traditional RNN base- in this direction is the learned iterative shrinkage thresh- lines (i.e., vanilla RNN, LSTM and GRU) while being olding algorithm (LISTA) [Gregor and LeCun, 2010] that smaller in size. For the second task, we observe a signif- solves the sparse coding problem. Iterative optimization al- icant improvement of reweighted-RNN over all other gorithms are usually highly interpretable because they are models. developed via modeling the physical processes underlying • By taking speech command detection as an example, the problem and/or capturing prior domain knowledge [Lu- we show that the proposed GEB for reweighted-RNN is cas et al., 2018]. Hence, deep networks designed by deep tight and agrees with the empirical generalization gap. unfolding capture domain knowledge and promote model- This can be achieved when the model is sufficiently based structure in the data; in other words, such networks regularized, while maintaining high classification ac- can be naturally interpreted as a parameter optimized al- curacy. gorithm [Monga et al., 2019]. Deep unfolding has been used to develop interpretable RNNs: SISTA-RNN [Wis- The remainder of this paper is as follows: Section 2 reviews dom et al., 2017] is an extension of LISTA that solves a traditional stacked RNN models, followed by an introduc- sequential signal reconstruction task based on sparse mod- tion to deep unfolding RNNs. Section 3 presents the pro- elling with the aid of side-information. Alternatively, Le posed GEBs for deep unfolding RNNs, which is obtained by et al. [2019] designed `1-`1-RNN by unfolding a proximal studying the complexity of their latent representation stage. method that solves an `1-`1 minimization problem [Mota The bound is then extended to the classification problem. In et al., 2017]. In our recent work [Luong et al., 2021], we Section 4, we experimentally compare reweighted-RNN to proposed reweighted-RNN, which unfolds a reweighted ver- other deep unfolding and traditional RNN models on clas- sion of the `1-`1 minimization problem and incoporates sification tasks. We also evaluate the GEB experimentally additional weights so as to increase the model expressivity. and relate the theoretical aspects to empirical observations. Deep unfolding RNNs excel in solving the underlying signal reconstruction tasks, outperforming traditional RNN base- lines while having a substantially lower parameter count 2 BACKGROUND TO RNN MODELS than the LSTM or gated recurrent unit (GRU) [Cho et al., 2014] models; however, their ability to leverage the sparse 2.1 STACKED RNNS structure of data as a means to solve traditional RNN tasks (e.g., sequence classification) is relatively unexplored. Fur- For each time step t = 1,...,T , the vanilla RNN recur- thermore, despite the progress in deep unfolding models, sively updates the latent representations ht through linear their generalization ability has received no such attention; transformation of the input sample xt and the previous state particularly, no work exists studying the GEBs of deep un- ht−1, followed by a non linear activation σ(·). To achieve folding RNN models. In this paper, we study theoretical higher expressivity, one may use a stacked vanilla RNN [Pas- aspects of deep unfolding RNNs by means of the GEBs. canu et al., 2014] by juxtaposing multiple RNN layers. In (l) We also benchmark deep unfolding RNNs in classification this setting, the representation ht at layer l and time step t problems and relate the empirical generalization error to the is computed by theoretical one.   (l)   σ V1h + U1xt , if l = 1, The contributions of this work are as follows: (l) t−1 ht =  (l−1) (l)  (1)  σ Wlht + Vlht−1 , if l > 1, • We derive generalization error bounds (GEB) for deep unfolding RNNs by means of Rademacher complexity where U1, Wl, Vl are the weight matrices learned per layer analysis, by taking the reweighted-RNN model [Lu- l and σ(·) is the . For sequence classifi- ong et al., 2021] as a prototype, which also subsumes cation tasks, RNN models can be trained end-to-end with a softmax-activated linear layer to obtain output class prob- Algorithm 1: Reweighted `1-`1 algorithm for sequen- abilities. It is recognized that the vanilla RNN is prone to tial signal reconstruction. gradient vanishing, resulting in training instabilities in the 1 Input: Measurements x ,..., x , measurement case of long data sequences and deep architectures. Popular 1 T matrix A, dictionary D, affine transform P, RNN variants including LSTM [Hochreiter and Schmidhu- (d) ber, 1997] and GRU [Cho et al., 2014] alleviate this issue by initial h0 ≡ h0, reweighting matrices Z1,..., Zd leveraging gating mechanisms to better capture long-term and vectors g1,..., gd, c, λ1, λ2. dependencies in the data. More recent lightweight models 2 Output: Sequence of sparse codes h1,..., hT . include FastGRNN by Kusupati et al. [2018] and Spectral- 3 for t = 1,. . . ,T do (0) (d) RNN [Zhang et al., 2018] make use of weight factorization 4 ht = Pht−1 to achieve low parameter count; such models improve the 5 for l = 1 to d do stability of deeper RNNs and are highly competitive with 6 u = 1 T T (l−1) 1 T T standard RNNs on various prediction tasks. [Zl − c ZlD A AD]ht + c ZlD A xt (l) 7 ht = Φ λ1 λ2 (d) (u) c gl, c gl,Pht−1 2.2 DEEP UNFOLDING RNNS 8 end 9 end Unlike traditional RNNs, the considered deep unfolding (d) (d) 10 return h ,..., h RNNs follow a signal reconstruction model with low- 1 T complexity data priors. How these models generalize to classification tasks, e.g. by training a linear classifier on (d) The first term in Problem (2) corresponds to the data fi- the latent representation hT , remains an open problem. In what follows, we review our recent reweighted-RNN [Lu- delity term and λ1, λ2 > 0 are regularization parameters ong et al., 2021] that unfolds a proximal algorithm to solve controlling the weighted-`1 penalization terms. Note that a generic sequential signal reconstruction problem. the presence of Z in Problem (2) is a deviation from the reweighted minimization scheme of Candès et al. [2008]. Nevertheless, the study in [Luong et al., 2021] shows that Z 2.2.1 Reweighted-RNN enables the unfolded RNN to achieve higher expressivity by following more closely the vanilla RNN architecture. Problem Formulation: Consider the problem of recon- s ∈ n0 t = 1, 2,...,T structing a sequence of signals t R , , Proximal Algorithm: We use a proximal gradient x = from a sequence of noisy measurement vectors t method [Beck and Teboulle, 2009] to solve Problem (2), As + η x ∈ n, A ∈ n×n0 (n  n ) t t where t R R 0 . Accord- which is given in Algorithm 1. At iteration l, the algo- ing to domain knowledge in sparse signal reconstruction rithm performs a proximal gradient update in two steps: with side information and video-frame reconstruction [Deli- first, by updating h(l−1) using the gradient of the fidelity giannis et al., 2013, Mota et al., 2017], it is assumed that: t h term of Problem (2). Second, it applies the proximal op- (i) st has a sparse representation ht ∈ R in a dictionary Φ λ λ (u) n0×h erator 1 g , 2 g , element-wise to the result, with D ∈ R , that is, st = Dht; and (ii) the difference of c l c l ~ two successive sparse representations is also sparse up to an ~ ≡ Pht−1 denoting the side-information from the pre- h×h vious time-step. affine transform P ∈ R promoting the temporal correla- tion, i.e, ht −Pht−1 is small in the `1-norm sense. Assump- The closed-form solution of the proximal operator is given tions (i) and (ii) define a sequential `1-`1 reconstruction in the supplementary material and we refer to Luong et al. problem; it is however recognized that `1-`1 minimization [2021] for extended proofs. Fig. 1 depicts the proximal can be improved by reweighted minimization [Candès et al., operators for ~ ≥ 0, which is essentially a 2-plateaus soft- 2008, Van Luong et al., 2018], which leads to sparser repre- thresholding function. Notice that each entry of gl controls sentations. The reweighting scheme introduces the positive the width of the two thresholds, independently for each g ∈ h weights R to individually penalize the magnitude of element. The plateau at height ~ in Fig. 1 promotes the each element of ht and ht − Pht−1. To further general- correlation of the signal with the side information. ize, a reweighing matrix Z ∈ Rh×h is introduced as an extra transformation of h . The considered reweighted ` -` t 1 1 Deep Unfolded RNN: The reweighted-RNN model [Lu- minimization problem is: ong et al., 2021] is build by unrolling Algorithm 1 across the n1 iterations l = 1, . . . , d (yielding the hidden layers) and time min kx − ADZh k2 + λ kg ◦ Zh k t t 2 1 t 1 steps t = 1,...,T . The proximal operator Φ λ1 λ2 (u) ht 2 c gl, c gl,~ o also serves as the non-linear activation function. The form + λ2kg ◦ (Zht − Pht−1)k1 , t = 1,...,T. of this function depends on the learnable parameters λ1, λ2, (2) c, and gl, ∀l = 1, . . . , d, allowing reweighted-RNN to learn 1 2 () , ,ℏ ... ℏ ...

... 1 2 − − 0 ... 1 − 2 ℏ + 1 − 2 ℏ + 1 + 2 ...

......

Figure 1: The generic form of the proximal operator ...

...... Φ λ1 λ2 (u) for Algorithm 1 (particular case of ~ ≥ 0) c gl, c gl,~ - but also the activation function in the reweighted-RNN. ... Note that per unit per layer gl leads to a different activation function. ...

(a) 1 2 () , ,ℏ (1) (2) () per-neuron adjustable activations. In summary, the layers of 0 0 0 reweighted-RNN1 2 realise1 the following2 1 updates:2 1 2 ℏ − − ℏ − + − + 2 1 (1) (2) 3 () 0 1 2 1 1 ... 1̂ + 1 1 1   (d)   Φ λ1 λ2 (d) W1ht−1 + U1xt , if l = 1, (1) (2) () (l) c g1, c g1,Pht−1 1 1 1 ht =   (l−1) 1 2  Φ λ1 λ2 (d) Wlhℏt + Ulxt , if l > 1, c gl, c gl,Pht−1 1 (1) 2 (2) 3 () 2 2 ... ̂ (3) 2 2 2 2 where Ul, W1, Wl are defined as ...... (1) ... (2) () −1 −1 −1 ...... 1 1 2 T T Ul = ZlD A , ∀l, (4) 1 (1) 2 (2) 3 () ... ̂ c 1 W = Z P − Z DTATADP, (5) 1 1 c 1 (b) 1 W = Z − Z DTATAD, l > 1. (6) l l c l Figure 2: The architecture of (a) reweighted-RNN vis-à-vis (b) the vanilla stacked RNN [Pascanu et al., 2014]. and the trainable parameters of the model are Θ =

{A, D, P, h0, Z1,..., Zd, g1,..., gd, c, λ1, λ2} with the (l) parameters A, D and P corresponding to the sensing ma- updates ht according to: trix, the sparsifying dictionary and the correlation matrix,   (d)  φ λ λ (d) W h + U x , if l = 1, respectively. An illustration of reweighted-RNN is given in (l)  1 , 2 ,Ph 1 t−1 1 t h = c c t−1 Fig. 2(a), in contrast to the stacked vanilla RNN in Fig. 2(b): t  (l−1)   φ λ1 , λ2 ,Ph(d) W2ht + U1xt , if l > 1, it can be observed that reweighted-RNN is characterized c c t−1 (7) by skip-connections since the input xt is incorporated into (l) each layer, while the side information Pht−1 modulates the 1 with U = DT AT , (8) activation function Φ λ1 λ2 . 1 c gl, c gl,~ c 1 W = P − DT AT ADP, (9) 1 c 1 2.2.2 ` -` -RNN W = I − DT AT AD. (10) 1 1 2 c

`1-`1-RNN was proposed by Le et al. [2019] as a deep `1-`1-RNN and reweighted-RNN have similar update equa- unfolding RNN that learns the iterations of a proximal algo- tions and activation functions, to the difference that the rithm to solve an `1-`1 minimization problem, that is, Prob- absence of a trainable weight Zl in `1-`1-RNN leads to lem (2) with Z = I and g = 1. Thus, `1-`1-RNN can be shared weights Wl and Ul, starting from l = 2, reducing seen as instance of reweighted-RNN with update equations the overall expressivity of the model. Also, the shape of its obtained by setting Z1,..., Zd = I and g1,..., gd = 1 activation function cannot be controlled per element, since in (3), (4), (5), (6). As a result, one layer of `1-`1-RNN gl is fixed to the all-ones. 3 GENERALIZATION ERROR BOUNDS It can be remarked that the bound in (14) depends on the

FOR THE DEEP UNFOLDING RNNS training set S, which is able to be applied to a number of learning problems, e.g., regression and classification, given We briefly present the theoretical background regarding a `. Rademacher complexity and the related generalization error bound (GEB). We then propose an upper bound of the em- pirical Rademacher complexity of reweighted-RNN, along 3.1 DERIVED GENERALIZATION ERROR with a similar bound for `1-`1-RNN which is subsumed by BOUNDS the former. This bound is derived for the latent representa- tion stage. We then formulate a GEB for the same models The following theorem presents an upper bound on the em- when used in a classification setting, i.e., by appending a pirical Rademacher complexity of the family of functions single-layer linear classification layer to the models. given by reweighted-RNN:

Notation: Let f (d) : n 7→ h be the function computed Theorem 3.3 (Empirical Rademacher Complexity of W R R h n h by a d-layer network with weight parameters W. The net- reweighted-RNN). Let Fd,T : R × R 7→ R denote (d) n the functional class of reweighted-RNN with T time steps, work fW maps an input sample xi ∈ R (from an input h where kWlk1,∞ ≤ αl, kUlk1,∞ ≤ βl, and 1 ≤ l ≤ d. As- space X) to an output yi ∈ R (from an output space Y), i.e., √ (d) sume that the input data kXtk2,∞ ≤ mBx, and an initial y = f (xi). Let S denote a training set of size m, i.e., i W hidden state h0. Then, S = {(x , y )}m [·] i i i=1 and E(xi,yi)∼S denote an expectation over (xi, yi) from S. The set S is drawn i.i.d. from a distri- bution D, denoted as S ∼ Dm, over a space Z = X×Y. Let r F be a (class) set of functions. Let ` : F×Z 7→ denote the 2(4dT log 2 + log n + log h) R R (F ) ≤ loss function and `◦F = {z 7→ `(f, z): f ∈ F}. We define S d,T m v the true loss and the empirical (training) loss by LD(f) and u d u X 2ΛT − 12 LS(f), respectively, as follows: 0 2 2T 2 · t βlΛl Bx + Λ0 kh0k∞, Λ0 − 1   l=1 LD(f) = ` f(xi), y , (11) E(xi,yi)∼D i (15) and   d LS(f) = E(xi,y )∼S ` f(xi), yi . (12) Q i with Λl defined as follows: Λl = αk with 0 ≤ l ≤ k=l+1 The generalization error, calculated by LD(f) − LS(f), d − 1 and Λd = 1. measures how accurately a learned algorithm is able to predict outcome values for unseen data. The proof of Theorem 3.3 is given in the supplementary Definition 3.1 (Empirical Rademacher Complexity). Let material. The Rademacher complexity in Theorem 3.3 can F be a hypothesis set of functions (e.g., neural networks). be used in combination with the GEB of Theorem 3.2 for any The empirical Rademacher complexity of F [Shalev-Shwartz choice of Lipschitz-continuous and bounded loss function. If and Ben-David, 2014] for a training sample set S is defined the complexity measure is small, the network can be learned as follows: with a small generalization error. The bound in (15) is in the order of the square root of the network depth d multiplied by " m # 1 X the number of time steps T . It also depends on the logarithm RS(F) = E sup if(xi) , (13) m ∈{±1}m f∈F of the number of measurements n and the number of hidden i=1 units h. It is worth mentioning that the second square root where  = (1, ..., m), here i is independent uniformly in (15) only depends on the norm constraints and the input distributed random (Rademacher) variables from {±1}, ac- training data, and it is independent of the network depth d cording to P[i = 1] = P[i = −1] = 1/2. and the number of time steps T under the appropriate norm constraints. The GEB for the functional family F is derived based on the We also derive a similar bound for `1-`1-RNN considering Rademacher complexity in the following theorem: that it is an instance of reweighted-RNN (cf. Section 2.2.2) Theorem 3.2. [Shalev-Shwartz and Ben-David, 2014, The- Corollary 3.3.1 (The empirical Rademacher complexity orem 26.5] Assume that |`(f, z)| ≤ η for all f ∈ F and z. h n h Then, for any δ > 0, with probability at least 1 − δ, of `1-`1-RNN). Let Fd,T : R × R 7→ R denote the functional class of `1-`1-RNN with T time steps, where r 2 log(4/δ) kW1k1,∞ ≤ α1, kW2k1,∞ ≤ α2, kU1k1,∞ ≤ β√1, and L (f) − L (f) ≤ 2R (` ◦ F) + 4η . (14) D S S m 1 ≤ l ≤ d. Assume that the input data kXtk2,∞ ≤ mBx, (d) initial hidden state h0. Then, that y(hT ) is ρ-Lipschitz on its input with ρ = th r min(maxi kYik , maxi kYik ) where Yi denotes the i 2(4dT log 2 + log n + log h) 2 1 RS(Fd,T ) ≤ row of Y. Following these properties and by using the con- m traction lemma [Shalev-Shwartz and Ben-David, 2014], The- v u d T T (d−1) orem 3.3 can be reformulated into the following proposition: u α − 12α α − 12 β2 2 1 2 B2 u 1 (d−1) x u α2 − 1 .u α1α2 − 1 (16) Proposition 3.4 (Generalization Error Bound Extended to t 2T 2T (d−1) 2 the RNN with Linear Classifier). Consider the functional + α1 α2 kh0k∞. c family of classifiers F = {`γ ◦ y ◦ f : X 7→ R | f ∈ Fd,T } obtained by composing reweighted-RNN (or `1-`1-RNN) By contrasting (15) with (16) under the assumption of with the linear classifier y = Yh(d) and with the `c loss. ` T γ constrained weights, we see that the complexity of 1- Consider ρ = min(max kY k , max kY k ) where Y ` α β i i 2 i i 1 i 1-RNN has a polynomial dependence on 1, 1 and denotes the ith row of Y. Then, α2 (the norms of first two layers), whereas the complex- r ity of reweighted-RNN has a polynomial dependence on 2ρ 2 log(4/δ) α , . . . , α and β , . . . , β (the norms of all layers). This LD,γ (f) − LS,γ (f) ≤ RS(Fd,T ) + 4 . 1 d 1 d γ m over-parameterization offers a flexible way to control the (18) generalization error of reweighted-RNN. A comparison can be made between our bounds and the ones 4 EXPERIMENTAL RESULTS proposed by Zhang et al. [2018] for SpectralRNN and Kusu- pati et al. [2018] for FastRNN (the non-gated version of In this section, we benchmark deep unfolding RNNs, specif- FastGRNN). Overall, all bounds, including ours, exhibit ically reweighted-RNN, `1-`1-RNN and SISTA-RNN, on a T -exponential dependency on the weight norms, which a speech command classification task using the Google-12 suggests that regularization mechanisms must be used to to and Google-30 datasets [Warden, 2018] and on a language achieve low model capacity and tight generalization guaran- modelling task using the Wikitext2 dataset [Merity et al., tees. If we assume that weights are bounded so as to neglect 2016]. A comparison is made with various RNN models in- the T -exponential dependency, the GEB of SpectralRNN 2 cluding the vanilla RNN, LSTM and GRU baselines, and the is still dependent on T . Likewise, Kusupati et al. [2018] lightweight SpectralRNN and FastGRNN models. Although shows that the GEB of FastRNN can be made linear in T deep unfolding RNNs have been traditionally used for sig- by restricting the range of some of the model’s parameters.√ nal reconstruction tasks, our experiments show promising With constrained weights, our GEBs depends on T , mean- results in solving traditional RNN tasks, with reweighted- ing that our bounds are tighter in terms of the number of RNN achieving superior performance. Finally, we relate the time steps. experimental generalization gap observed on the Google-30 c dataset with the derived GEB using the `γ loss defined in 3.2 EXTENSION TO CLASSIFICATION Section 3.2 and an appropriate weight regularization scheme. Our theoretical bound is shown to follow the empirical gen- We present an extension of our bounds for the linear classifi- eralization performance of reweighted-RNN. (d) cation task with c classes using the final latent vector hT as input. We particularize the GEB for the ramp loss rγ , which 4.1 EXPERIMENTAL SETUP is paramaterized by γ > 0 and defined as  Concerning reweighted-RNN, the overcomplete dictio- 0, x < −γ,  x nary D is initialized according to a random uniform distribu- rγ (x) = 1 + , −γ ≤ x ≤ 0, (17) γ tion, the weights Zl, ..., Zd and P are initialized to the iden- 1, 0 < x. tity I and the reweighing vectors g1, ..., gd to the all-ones vectors. We initialize {c, λ1, λ2} to {1.0, 0.003, 0.02}. The In practice, we evaluate rγ on a data sample y with ground- measurement operator A is fixed to I. We add a trainable truth label c using the classification margin M(y, c) = n ×h 0 weight matrix Y ∈ c to obtain classification scores 0 R y[c] − maxc 6=c y[c ] by computing rγ (−M(y, c)). For con- (d) c based on the last hidden state according to y = Yh with venience, we define `γ (y) ≡ rγ (−M(y, c)). Intuitively, T c nc corresponding to the number of classes (or the number `γ (y) penalizes samples that are classified incorrectly or c of tokens in the case of language modeling). For all models correctly with low margin. It can be noted that `γ (y) is 1 in general, we use h = 100 hidden units and varying hidden uniformly bounded by η = 1 and is γ -Lipschitz, which is required by the assumptions of Theorem 3.2. layers d = {1, 2, 3, 4, 5, 6}. Models are trained in PyTorch by optimizing the cross-entropy loss using the Adam opti- We also assume that the classifier output y is given mizer. The initial learning rate is set to 3×10−4 and follows (d) (d) c×h by YhT ≡ y(ht ) with Y ∈ R ; it follows a step decay policy with a factor 0.3 every 50 epochs, except Table 1: Test accuracy on the Google-12 dataset for differ- Table 3: Test perplexity on the Wikitext dataset for differ- ent network depths d. (a): RNN baselines, (b): lightweight ent network depths d. (a): RNN baselines, (b): lightweight RNNs, (c): deep unfolding RNNs RNNs, (c): deep unfolding RNNs

Model d=1 d=2 d=3 d=4 d=5 d=6 Model d=1 d=2 d=3 d=4 d=5 d=6 RNN 88.60 89.40 88.78 89.46 90.48 90.86 RNN 145.6 140.3 137.3 134.2 122.3 121.2 (a) LSTM 90.42 91.30 92.24 91.30 92.02 91.46 (a) LSTM 127.3 119.8 118.3 121.4 112.3 110.4 GRU 91.07 92.05 92.28 92.24 92.15 90.94 GRU 143.4 112.2 117.0 108.0 112.6 109.4 SpectralRNN 89.80 91.20 91.20 91.20 91.50 91.80 SpectralRNN 69.1 68.5 67.4 65.8 63.1 59.8 (b) (b) FastGRNN 91.21 91.34 91.42 91.53 91.66 91.72 FastGRNN 69.3 69.2 67.8 66.2 66.1 64.7 SISTA-RNN 90.14 90.88 91.12 91.50 92.02 92.12 SISTA-RNN 128.2 121.6 116.3 109.3 94.3 88.7 (c) `1-`1-RNN 90.27 91.02 91.34 91.78 92.14 92.20 (c) `1-`1-RNN 67.7 67.3 65.6 63.7 59.5 56.6 reweighted-RNN 90.49 91.30 91.23 91.66 92.37 92.41 reweighted-RNN 67.1 65.4 67.5 59.5 56.4 52.1

Table 2: Test accuracy on the Google-30 dataset for differ- ent network depths d. (a): RNN baselines, (b): lightweight models. For d = 3 layers, we observe that reweighted-RNN, RNNs, (c): deep unfolding RNNs `1-`1-RNN and SISTA-RNN have 48K, 17K and 8K param- eters respectively, while FastGRNN and SpectralRNN have Model d=1 d=2 d=3 d=4 d=5 d=6 43K and 50K parameters respectively. Finally, the vanilla RNN 35.26 38.27 45.28 47.80 48.90 54.27 (a) LSTM 92.14 93.83 93.69 93.80 92.68 92.48 RNN, LSTM and GRU models have respectively 61K, 223K GRU 91.60 92.77 92.85 92.51 92.86 91.82 and 169K parameters. SpectralRNN 91.30 91.81 92.00 92.17 93.50 93.72 (b) FastGRNN 89.34 88.65 90.68 91.51 92.56 93.51 SISTA-RNN 92.18 92.50 93.17 93.22 93.64 93.78 4.3 LANGUAGE MODELLING TASK (c) `1-`1-RNN 92.32 92.78 93.05 93.70 93.80 93.92 reweighted-RNN 92.51 92.95 93.97 93.81 94.10 94.00 Language modeling (LMs) serves as a foundation for natural language processing. The task involves predicting the n+1 for GRU and LSTM which use an initial learning rate of token given a history of n such tokens. Trained LMs have −3 3×10 . The batch size is set to 100 and models are trained been applied to [Yu and Deng, 2014], for 200 epochs. While deep unfolding RNNs have their own machine translation [Cho et al., 2014], natural language layer-stacking schemes derived from unfolding minimiza- generation and as a feature extractor for general downstream tion algorithms, we use the stacking rule in [Pascanu et al., tasks [Peters et al., 2018]. LMs act at multiple levels of gran- 2014] to build deep networks for other RNN architectures. ularities, i.e., at word, sub-word or character level. While the objective remains the same, each of them adds a layer of 4.2 SPEECH COMMAND DETECTION complexity and challenges. We test the different RNNs on the Wikitext2 dataset [Merity et al., 2016], a commonly used The Google-12 and Google-30 datasets [Warden, 2018] con- dataset consisting of 2 million tokens. As the Wikipedia ar- tain utterances of short speech commands with background ticles are relatively long, capturing long range dependencies noise belonging to 12 and 30 classes, respectively. We adopt is key to achieve reasonable performance. Table 3 shows the same featurization technique of Kusupati et al. [2018], the test perplexity obtained for number of hidden layers consisting of the standard log Mel-filter-bank featurization d. Similar to the previous experiments, we observe that with 32 filters on T = 99 time steps. reweighted-RNN outperforms the other considered RNN models. We report our test classification accuracies in Tables 1 and 2. For Google-12, reweighted-RNN outperforms both the tradi- tional RNN baselines and the lightweight models with 5 and 4.4 ERROR BOUND EVALUATION 6 layers, whereas GRU achieves superior accuracy for lower number of layers. For Google-30, reweighted-RNN achieves We perform an empirical evaluation of the generalization higher performance almost systematically. The improved error bound (GEB) for the classification task on the Google- accuracy of reweighted-RNN is due to the additional learn- 30 dataset. It can be observed that the GEB based on the able weights Zl and the per-neuron adaptive activations, Rademacher complexity of reweighted-RNN in (15) de- which increase the expressivity of the model. The results are pends on the T -exponential of the norms of Wl. However, also in line with the ablation study in [Luong et al., 2021], in Section 4.2, models are trained without explicit regulariza- which demonstrated that reweighted-RNN improves video- tion of the weight’s norm, leading to potentially high GEBs frame reconstruction over other deep unfolding RNNs and that do not provide useful guarantees on the performance of traditional ones. Although standard RNNs outperform deep the model on unseen data. Consequently, we introduce an unfolding RNNs in some configurations, deep unfolding explicit regularization term in the training loss function by architectures are still interesting in terms of model size, es- multiplying the maximum `1-norm of the rows of Wl with a pecially compared to the well-performing GRU and LSTM regularization parameter λ to ensure that the T -exponential = 0.52 = 1.40 = 3.73 Test accuracy: 90.9% Test accuracy: 89.9% Test accuracy: 88.5% Table 4: Evaluation of the Rademacher complexity and test 1.2 1.2 1.2 Train loss Train loss Train loss Test loss Test loss Test loss 1.0 1.0 1.0 accuracy of the single-layer reweighted-RNN and `1-`1- Test loss upper bound Test loss upper bound Test loss upper bound

0.8 0.8 0.8 RNN models trained on the Google-30 dataset for various

0.6 0.6 0.6 regularization levels

0.4 0.4 0.4

0.2 0.2 0.2 Test acc. RS(Fd,T ) 0.0 0.0 0.0 Model 0.0 2.0 4.0 6.0 8.0 10.0 0.0 2.0 4.0 6.0 8.0 10.0 0.0 2.0 4.0 6.0 8.0 10.0 (%) (upper bound) λ = 0.52 90.9 2.92 × 10−1 Figure 3: Evaluation of the error bound on the ramp loss reweighted-RNN λ = 1.40 89.9 1.83 × 10−1 −1 for various values of γ, for three regularization levels λ, for λ = 3.73 88.5 1.61 × 10 14 reweighted-RNN trained on the Google-30 dataset. λ = 0.52 84.8 5.45 × 10 6 `1-`1-RNN λ = 1.40 82.4 3.59 × 10 λ = 3.73 73.5 3.94 × 100 dependency on the weights becomes negligible. In practice, we train a single layer reweighted-RNN model using three obtained via Rademacher complexity analysis of a class of different regularization levels λ = {0.52, 1.40, 3.73} and deep unfolding RNNs, namely, the reweighted-RNN [Lu- using the categorical cross-entropy as the base loss, accord- ong et al., 2021] and ` -` -RNN [Le et al., 2019] models. ing to the experimental protocol described in Section 4.1. 1 1 For sufficiently constrained weights, the GEB is polynomi- The classification GEB (18) is then calculated for the `c loss γ ally dependent on the norm of the weights and the over- for different margins γ and for δ = 0.05 (i.e., a probability parametrization scheme of reweighted-RNN, allowing for a of 95% that the bound is met). finer control of the bound. Moreover, our proposed bounds Empirical evaluations of the GEB are reported in Fig. 3 are tighter than similar ones obtained for state-of-the-art along with the respective losses observed in practice on the lightweight RNN models, in terms of time steps T . Fur- training and test sets. For small values of γ, the theoretical thermore, our experiments have demonstrated that deep c GEB is close to the unity (note that `γ cannot exceed 1 in unfolding RNNs can outperform state-of-the-art traditional practice). By increasing γ, the empirical generalization gap RNNs in classification problems, leading to interesting per- decreases and the theoretical bound becomes tighter. Around spectives for deep unfolding RNNs outside of their original γ = 4.0, the theoretical bound on the test error is minimal. application fields. These experiments also show that it is For very large values of γ, the empirical training and testing possible to tighten the gap between the theoretical bound of errors become much larger as a result of the large margin γ; reweighted-RNN and the empirical generalization error, if even if the bound is tighter, this setting is less indicative the model is sufficiently regularized while not affecting its of the true performance of the model. Nevertheless, for accuracy. reasonable choices of γ, our GEB is tight and proves that our model has good generalization properties with small test loss. 6 ACKNOWLEDGEMENT As shown in Table 4, the accuracy of reweighted-RNN We would like to thank the reviewers for their comments slightly decreases with higher regularization levels, with which helped improve the manuscript. This research re- a decrease of 2.4% in test accuracy from λ = 0.52 to ceived funding from FWO, Belgium (research project λ = 3.73; it is however necessary to perform weight regu- G093817N and PhD fellowship strategic basic research larization to ensure that the theoretical GEB is tight and in- 1SB5721N), and Innoviris, Belgium (REFINED Project). formative. Moreover, it was found that `1-`1-RNN is hardly able to maintain a sufficiently high test accuracy while keep- References ing the theoretical GEB in an acceptable range. In Table 4, the test accuracy of ` -` -RNN drops to 73.5% for the upper 1 1 Nil-Jana Akpinar, Bernhard Kratzwald, and Stefan Feuer- bound on R to lead to a sufficiently tight GEB. This further S riegel. Sample complexity bounds for recurrent neural confirms that reweighted-RNN offers more flexibility than networks with application to combinatorial graph prob- ` -` -RNN as a result of the additional learnable weights Z 1 1 l lems. arXiv preprint arXiv:1901.10289, 2019. and per-neuron learnable activations, while ensuring good generalization abilities according to the GEB (18). Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A compressed sensing view of unsupervised text embeddings, bag-of-n-grams, and LSTMs. In Inter- 5 CONCLUSIONS national Conference on Learning Representations, 2018. In this paper, we have explored theoretical aspects of deep Leila Arras, José Arjona-Medina, Michael Widrich, Gré- unfolding RNNs by establishing generalization error bounds goire Montavon, Michael Gillhofer, Klaus-Robert Müller, Sepp Hochreiter, and Wojciech Samek. Explaining and Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align-

interpreting LSTMs. In Explainable AI: Interpreting, ex- ments for generating image descriptions. IEEE Transac- plaining and visualizing deep learning, pages 211–238. tions on Pattern Analysis and Machine Intelligence, 39 Springer, 2019. (4):664–676, April 2017.

Dzmitry Bahdanau, Kyunghyun Cho, and . Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing Neural machine translation by jointly learning to align and understanding recurrent networks. arXiv preprint and translate. In 3rd International Conference on Learn- arXiv:1506.02078, 2015. ing Representations, 2015. Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Ku- Peter L. Bartlett and Shahar Mendelson. Rademacher and mar, Prateek Jain, and Manik Varma. FastGRNN: A fast, Gaussian complexities: Risk bounds and structural results. accurate, stable and tiny kilobyte sized gated recurrent Journal of Research, 3(Nov):463–482, neural network. In Advances in Neural Information Pro- 2002. cessing Systems 31, 2018.

Amir Beck and Marc Teboulle. A fast iterative shrinkage- Hung Duy Le, Huynh Van Luong, and Nikos Deligiannis. thresholding algorithm for linear inverse problems. SIAM Designing recurrent neural networks by unfolding an journal on imaging sciences, 2(1):183–202, 2009. l1-l1 minimization algorithm. In Proceedings of IEEE International Conference on Image Processing, 2019. Olivier Bousquet and André Elisseeff. Stability and gener- alization. The Journal of Machine Learning Research, 2: Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding 499–526, 2002. neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016. Emmanuel J. Candès, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted ` minimization. 1 Alice Lucas, Michael Iliadis, Rafael Molina, and Aggelos K. Journal of Fourier Analysis and Applications, 14(5):877– Katsaggelos. Using deep neural networks for inverse 905, 2008. problems in imaging: Beyond analytical methods. IEEE Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Signal Processing Magazine, 35(1):20–36, 2018. Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Huynh Van Luong, Boris Joukovsky, and Nikos Deligian- and Yoshua Bengio. Learning phrase representations nis. Designing interpretable recurrent neural networks using RNN encoder-decoder for statistical machine trans- for video reconstruction via deep unfolding. IEEE Trans- lation. arXiv preprint arXiv:1406.1078, 2014. actions on Image Processing, 30:4099–4113, 2021. Nikos Deligiannis, Adrian Munteanu, Shuang Wang, Samuel Cheng, and Peter Schelkens. Maximum likeli- David McAllester. Simplified PAC-Bayesian margin bounds. hood Laplacian correlation channel estimation in layered In Learning theory and Kernel machines, pages 203–215. wyner-ziv coding. IEEE Transactions on Signal Process- Springer, 2003. ing, 62(4):892–904, 2013. Stephen Merity, Caiming Xiong, James Bradbury, and Gintare Karolina Dziugaite and Daniel M. Roy. Computing Richard Socher. Pointer sentinel mixture models. arXiv nonvacuous generalization bounds for deep (stochastic) preprint arXiv:1609.07843, 2016. neural networks with many more parameters than training data. In Proceedings of the 33rd Conference on Uncer- Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm tainty in Artificial Intelligence, 2017. unrolling: Interpretable, efficient deep learning for signal and image processing. arXiv preprint arXiv:1912.10557, Yoav Goldberg. The unreasonable effectiveness of 2019. character-level language models, 2015. URL https://nbviewer.jupyter.org/gist/ João FC Mota, Nikos Deligiannis, and Miguel RD Ro- yoavg/d76121dfde2618422139. drigues. Compressed sensing with prior information: Strategies, geometry, and bounds. IEEE Transactions on Karol Gregor and Yann LeCun. Learning fast approxima- Information Theory, 63(7):4472–4496, 2017. tions of sparse coding. In Proceedings of the 27th interna- tional conference on international conference on machine Behnam Neyshabur, Srinadh Bhojanapalli, David learning, pages 399–406, 2010. McAllester, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for Sepp Hochreiter and Jürgen Schmidhuber. Long short-term neural networks. International Conference on Learning memory. Neural computation, 9(8):1735–1780, 1997. Representations, 2018. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen

Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren- ner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neu- ral networks. In International Conference on Learning Representations, 2014. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceed- ings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Lin- guistics, June 2018. Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cam- bridge University Press, New York, NY, USA, 2014. Huynh Van Luong, Nikos Deligiannis, Jürgen Seiler, Søren Forchhammer, and Andre Kaup. Compressive online robust principal component analysis via n-`1 minimiza- tion. IEEE Transactions on Image Processing, 27(9): 4314–4329, 2018. Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018. Colin Wei and Tengyu Ma. Data-dependent sample com- plexity of deep neural networks via Lipschitz augmen- tation. In Advances in Neural Information Processing Systems 32, pages 9725–9736, 2019. Scott Wisdom, Thomas Powers, James Pitton, and Les At- las. Building recurrent networks by unfolding iterative thresholding for sequential sparse recovery. In IEEE In- ternational Conference on Acoustics, Speech and Signal Processing, pages 4346–4350. IEEE, 2017.

Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012. Dong Yu and Li Deng. Automatic Speech Recognition: A Deep Learning Approach. Springer Publishing Company, Incorporated, 2014. ISBN 1447157788. Jiong Zhang, Qi Lei, and Inderjit S. Dhillon. Stabilizing gradients for deep neural networks via efficient SVD pa- rameterization. In Proceedings of the 35th International Conference on Machine Learning, 2018.