582: Generalization Error Bounds for Deep Unfolding Rnns

Generalization Error Bounds for Deep Unfolding RNNs Boris Joukovsky ID 1,2 Tanmoy Mukherjee ID 1,2 Huynh Van Luong ID 1,2 Nikos Deligiannis ID 1,2 1Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium 2imec, Kapeldreef 75, B-3001 Leuven, Belgium Abstract result, such models lack in theoretical understanding and interpretability. A step towards post-hoc interpretability of RNNs was made by Karpathy et al. [2015], where they Recurrent Neural Networks (RNNs) are powerful observed the advantages of internal states of the long short- models with the ability to model sequential data. term memory (LSTM) [Hochreiter and Schmidhuber, 1997] However, they are often viewed as black-boxes and over traditional RNNs to model interpretable patterns in the lack in interpretability. Deep unfolding methods input data; this study was further complemented by Gold- take a step towards interpretability by designing berg [2015]. Subsequent works in explainable deep learning deep neural networks as learned variations of itera- have also proposed methods to explain the decision-making tive optimization algorithms to solve various signal process of a trained LSTM for a given prediction, in terms processing tasks. In this paper, we explore theo- of the most influential inputs [Li et al., 2016, Arras et al., retical aspects of deep unfolding RNNs in terms 2019]. of their generalization ability. Specifically, we derive generalization error bounds for a class of deep Efforts toward a theoretical understanding of RNNs also unfolding RNNs via Rademacher complexity anal- involve the derivation of bounds on the generalization er- ysis. To our knowledge, these are the first general- ror, i.e., the difference between the empirical loss and the ization bounds proposed for deep unfolding RNNs. expected loss. Such bounds rely on measures—such as the We show theoretically that our bounds are tighter Rademacher complexity [Bartlett and Mendelson, 2002], than similar ones for other recent RNNs, in terms the PAC-Bayes theory [McAllester, 2003], the algorithm of the number of timesteps. By training models in stability [Bousquet and Elisseeff, 2002] and robustness [Xu a classification setting, we demonstrate that deep and Mannor, 2012]—and aim at understanding how accu- unfolding RNNs can outperform traditional RNNs rately a model is able to generalise to unseen data in relation in standard sequence classification tasks. These to the optimization algorithm used for training, the network experiments allow us to relate the empirical gen- architecture, or the underlying data structure. eralization error to the theoretical bounds. In par- Generalization error bounds (GEBs) for RNN models ticular, we show that over-parametrized deep un- were studied by Arora et al. [2018], Wei and Ma [2019] folding models like reweighted-RNN achieve tight and Akpinar et al. [2019]. The theoretical result by Arora theoretical error bounds with minimal decrease in et al. [2018] for low-dimensional embeddings using LSTMs accuracy, when trained with explicit regularization. models was derived for classification tasks. That work showed an interesting property of embeddings such as GloVe and word2vec, that is, forming a good sensing matrix 1 INTRODUCTION for text leads to better representations than those obtained with random matrices. The data-dependent sample complex- The past few years has seen an undeniable success of Recur- ity of deep neural networks was studied by Wei and Ma rent Neural Networks (RNNs) in applications ranging from [2019]. Unlike the existing Rademacher complexity bounds machine translation [Bahdanau et al., 2015] and image cap- that depend on norms of the weight matrices and depend tioning [Karpathy and Fei-Fei, 2017] to speech processing exponentially on the model depth [Bartlett and Mendelson, [Oord et al., 2016]. Despite their impressive performance, 2002, Dziugaite and Roy, 2017, Neyshabur et al., 2018], the existing approaches are typically designed based on heuris- study by Wei and Ma [2019] derived tighter Radermacher tics of the task and rely on engineering experience. As a bounds with the consideration of additional data-dependant Accepted for the 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). properties controlling the norms of hidden layers and the the ` -` -RNN [Le et al., 2019]. We also present an 1 1 norms of the Jacobians of each layer with respect to all extension of the GEBs when the RNN are used in a clas- previous layers. The derivation methods by Wei and Ma sification setting. We show theoretically that the pro- [2019] were extended to provide GEBs for RNNs scaling posed bounds are tighter than other recent lightweight polynomial with respect to the depth of the model. Specifi- RNNs with generalization guarantees, namely, Spec- cally, generalization abilities were explored by Akpinar et al. tralRNN [Zhang et al., 2018] and FastGRNN [Kusupati [2019], where explicit sample complexity bounds for single et al., 2018], in terms of number of time steps T . and multi-layer RNNs were derived. • We assess the performance of deep unfolding RNNs The focus of this work is in studying how model in classification settings, namely in speech command interpretability—by means of designing networks based detection and language modelling tasks. For the first on deep unfolding—interacts with model generalization task, our results show that reweighted-RNN outper- capacity. Deep unfolding methods design deep models as forms other lightweight RNNs, and that deep unfold- unrolled iterations of optimization algorithms; a key result ing RNNs are competitive with traditional RNN base- in this direction is the learned iterative shrinkage thresh- lines (i.e., vanilla RNN, LSTM and GRU) while being olding algorithm (LISTA) [Gregor and LeCun, 2010] that smaller in size. For the second task, we observe a signif- solves the sparse coding problem. Iterative optimization al- icant improvement of reweighted-RNN over all other gorithms are usually highly interpretable because they are models. developed via modeling the physical processes underlying • By taking speech command detection as an example, the problem and/or capturing prior domain knowledge [Lu- we show that the proposed GEB for reweighted-RNN is cas et al., 2018]. Hence, deep networks designed by deep tight and agrees with the empirical generalization gap. unfolding capture domain knowledge and promote model- This can be achieved when the model is sufficiently based structure in the data; in other words, such networks regularized, while maintaining high classification ac- can be naturally interpreted as a parameter optimized al- curacy. gorithm [Monga et al., 2019]. Deep unfolding has been used to develop interpretable RNNs: SISTA-RNN [Wis- The remainder of this paper is as follows: Section 2 reviews dom et al., 2017] is an extension of LISTA that solves a traditional stacked RNN models, followed by an introduc- sequential signal reconstruction task based on sparse mod- tion to deep unfolding RNNs. Section 3 presents the pro- elling with the aid of side-information. Alternatively, Le posed GEBs for deep unfolding RNNs, which is obtained by et al. [2019] designed `1-`1-RNN by unfolding a proximal studying the complexity of their latent representation stage. method that solves an `1-`1 minimization problem [Mota The bound is then extended to the classification problem. In et al., 2017]. In our recent work [Luong et al., 2021], we Section 4, we experimentally compare reweighted-RNN to proposed reweighted-RNN, which unfolds a reweighted ver- other deep unfolding and traditional RNN models on clas- sion of the `1-`1 minimization problem and incoporates sification tasks. We also evaluate the GEB experimentally additional weights so as to increase the model expressivity. and relate the theoretical aspects to empirical observations. Deep unfolding RNNs excel in solving the underlying signal reconstruction tasks, outperforming traditional RNN base- lines while having a substantially lower parameter count 2 BACKGROUND TO RNN MODELS than the LSTM or gated recurrent unit (GRU) [Cho et al., 2014] models; however, their ability to leverage the sparse 2.1 STACKED RNNS structure of data as a means to solve traditional RNN tasks (e.g., sequence classification) is relatively unexplored. Fur- For each time step t = 1;:::;T , the vanilla RNN recur- thermore, despite the progress in deep unfolding models, sively updates the latent representations ht through linear their generalization ability has received no such attention; transformation of the input sample xt and the previous state particularly, no work exists studying the GEBs of deep un- ht−1, followed by a non linear activation σ(·). To achieve folding RNN models. In this paper, we study theoretical higher expressivity, one may use a stacked vanilla RNN [Pas- aspects of deep unfolding RNNs by means of the GEBs. canu et al., 2014] by juxtaposing multiple RNN layers. In (l) We also benchmark deep unfolding RNNs in classification this setting, the representation ht at layer l and time step t problems and relate the empirical generalization error to the is computed by theoretical one. 8 (l) < σ V1h + U1xt ; if l = 1; The contributions of this work are as follows: (l) t−1 ht = (l−1) (l) (1) : σ Wlht + Vlht−1 ; if l > 1; • We derive generalization error bounds (GEB) for deep unfolding RNNs by means of Rademacher complexity where U1, Wl, Vl are the weight matrices learned per layer analysis, by taking the reweighted-RNN model [Lu-

582: Generalization Error Bounds for Deep Unfolding Rnns

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support