Reservoir Transformers

Reservoir Transformers Sheng Sheny, Alexei Baevskiz, Ari S. Morcosz, Kurt Keutzery, Michael Auliz, Douwe Kielaz yUC Berkeley; zFacebook AI Research [email protected], [email protected] Abstract is more, we find that freezing layers may actually improve performance. We demonstrate that transformers obtain im- Beyond desirable efficiency gains, random lay- pressive performance even when some of the ers are interesting for several additional reasons. layers are randomly initialized and never updated. Inspired by old and well-established Fixed randomly initialized networks (Gallicchio ideas in machine learning, we explore a variety and Scardapane, 2020) converge to Gaussian pro- of non-linear “reservoir” layers interspersed cesses in the limit of infinite width (Daniely et al., with regular transformer layers, and show im- 2016), have intriguing interpretations in metric provements in wall-clock compute time until learning (Rosenfeld and Tsotsos, 2019; Giryes convergence, as well as overall performance, et al., 2016), and have been shown to provide on various machine translation and (masked) excellent “priors” either for subsequent learn- language modelling tasks. ing (Ulyanov et al., 2018) or pruning (Frankle and 1 Introduction Carbin, 2018). Fixed layers allow for efficient low-cost hardware implementations (Schrauwen Transformers (Vaswani et al., 2017) have dom- et al., 2007) and can be characterized using only a inated natural language processing (NLP) in re- random number generator and its seed. This could cent years, from large scale machine transla- facilitate distributed training and enables highly tion (Ott et al., 2018) to pre-trained (masked) efficient deployment to edge devices, since it only language modeling (Devlin et al., 2018; Rad- requires transmission of a single number. The ford et al., 2018), and are becoming more pop- strong performance of networks with fixed lay- ular in other fields as well, from reinforcement ers also sheds new light on the inner workings learning (Vinyals et al., 2019) to speech recog- of BERT (Devlin et al., 2018), and layer-wise in- nition (Baevski et al., 2019) and computer vi- terpretations of such models (Rogers et al., 2020; sion (Carion et al., 2020). Their success is enabled Tenney et al., 2019). It appears that “not all layers in part by ever increasing computational demands, are created equal” (Zhang et al., 2019) is true to which has naturally led to an increased interest such an extent that some layers can simply remain in improving their efficiency. Scalability gains in random and fixed. transformers could facilitate bigger, deeper net- Random projections have a long history in arXiv:2012.15045v2 [cs.CL] 1 Jun 2021 works with longer contexts (Kitaev et al., 2020; machine learning. By Cover’s theorem (Cover, Wang et al., 2020; Beltagy et al., 2020; Kaplan 1965), any high-dimensional non-linear transfor- et al., 2020; Tay et al., 2020b). Conversely, mation is more likely to be linearly separable improved efficiency could reduce environmental than its lower-or-equal-dimensional input space. costs (Strubell et al., 2019) and hopefully help de- By Johnson-Lindenstrauss (Johnson and Linden- mocratize the technology. strauss, 1984), random projections distort Eu- In this work, we explore a simple question: if clidean distances very little under mild assump- some layers of the transformer are kept frozen— tions, which is useful e.g. for dimensionality re- i.e., never updated after random initialization— duction and random indexing (Sahlgren, 2005). can we match the performance of fully learned Fixed random layers in neural networks pre-date transformers, while being more efficient? Surpris- deep learning by far (Gamba et al., 1961; Baum, ingly, the answer is resoundingly yes; and what 1988). Indeed, random kernel methods have long been influential in machine learning (Rahimi and approximating upstream gradients using an Recht, 2008, 2009). approach we call backskipping, which can One way to think of such layers is as “reser- reduce the training compute further without voirs” (Lukoseviˇ ciusˇ and Jaeger, 2009), where a sacrificing performance. highly non-linear high-dimensional black box rep- resentation is provided to a lightweight “readout” 2 Approach network, as in echo state networks (Jaeger, 2003) This paper is based on a very simple idea. Neural and liquid state machines (Maass et al., 2002). The networks are trained via backpropagation, which benefit of such an approach is that the reservoir has involves consecutive steps of matrix addition and fixed parameters and is computationally efficient, multiplication, i.e., as it can be pre-computed and does not (necessarily) require backpropagation. In NLP, Wieting and Kiela(2019) showed that @J @J @J @Ln @L0 θt+1 θt − η ; = ··· random sentence encoders present a strong base- @θt @θt @Ln @Ln−1 @x line for text classification, with subsequent work for some objective J, parameterization θ and showing applications in a variety of tasks from learning rate η, with the gradient computed via summarization to machine translation (Enguehard the chain rule, where Li is the i-th layer of the et al., 2019; Garg et al., 2020; Pilault et al., 2020). neural network and x is the input. Let L = To our knowledge, this work is the first to exam- Transformer(X) be a single layer in a Transformer ine this phenomenon in transformers, and the first network (Vaswani et al., 2017), i.e., to recursively alternate reservoirs with subsequent transformer layers acting as readout functions. We introduce “reservoir transformers”, wherein fixed H = MultiHeadSelfAttn(LayerNorm(X)) + X random reservoir layers are interspersed with reg- L = FFN(LayerNorm(H)) + H ular updateable transformer layers. The goal of this work is to put our understanding of trans- Now, during every “backward pass”, we com- former models on a more solid footing by provid- pute the Jacobian for parameters θL at layer L, L ing empirical evidence of their capabilities even which are used to update the parameters of L, θt , when some of their parameters are fixed. Our con- as well as to compute the next layer’s Jacobian, tributions are as follows: thus back-propagating the gradients. In this work however, for some of the layers, we still backprop- area under the convergence • We introduce a agate through them to compute gradients for ear- curve metric for measuring performance- lier layers, but we never apply the parameter up- efficiency trade-offs, and show that replacing date. As a result, these layers stay fixed at their regular transformer layers with reservoir lay- initialization, saving computational resources. ers leads to improvements. 2.1 Background • We show that the addition of reservoir layers leads to improved test set generalization on a Naturally, never updating some of the parameters variety of tasks in a variety of settings. is computationally more efficient, as some matrix addition operations can be skipped in the back- • We show that pre-trained masked lan- ward pass, but why is this not detrimental to the guage modelling architectures like BERT and performance of the network? RoBERTa (Liu et al., 2019) can benefit from In the early days of neural networks, the bot- having some of their layers frozen, both dur- tom layers were often kept fixed as “associa- ing pre-training as well as when fine-tuning tors” (Block, 1962), or what (Minsky and Papert, on downstream tasks. 2017) called the Gamba perceptron (Gamba et al., 1961; Borsellino and Gamba, 1961). Fixed ran- • We experiment with different types of reser- dom networks (Baum, 1988; Schmidt et al., 1992; voir layers, including convolutional and re- Pao et al., 1994) have been explored from many current neural network-based ones. angles, including as “random kitchen sink” kernel • We show empirical evidence that the back- machines (Rahimi and Recht, 2008, 2009), “ex- ward pass can be skipped in its entirety by treme learning machines” (Huang et al., 2006) and reservoir computing (Jaeger, 2003; Maass et al., reservoir computing, most of which builds on 2002; Lukoseviˇ ciusˇ and Jaeger, 2009). In reser- recurrent neural network architectures. voir computing, input data are represented through fixed random high-dimensional non-linear rep- • CNN Reservoir: A fixed Convolutional Neu- resentations, called “reservoirs”, which are fol- ral Network (LeCun et al., 1998) layer, lowed by a regular (often but not necessarily lin- specifically light dynamical convolution lay- ear) “readout” network to make the final classifi- ers (Wu et al., 2019), which are known to be cation decision. competitive with transformers in sequence- The theoretical justification for these ap- to-sequence tasks. proaches lies in two well-known results in machine learning: Cover’s theorem (Cover, 1965) We find that all these approaches work well, to on the separability of patterns states that high- a certain extent. For clarity, we focus primarily on dimensional non-linear transformations are more the first two reservoir layers, but include a broader likely to be linearly separable; and the Johnson- comparison in AppendixA. Lindenstrauss lemma (Johnson and Lindenstrauss, In each case, contrary to traditional reservoir 1984) shows that (most) random projections dis- computing, our reservoir layers are interspersed tort Euclidean distances very little. throughout a regular transformer network, or what Practically, random layers can be seen as a we call a reservoir transformer. Since random pro- cheap way to increase network depth. There are jections are not learned and might introduce noise, interesting advantages to this approach. Fixed lay- subsequent normal transformer “readout” layers ers are known to have particularly low-cost hard- might be able to benefit from additional depth ware requirements and can be easily implemented while allowing us to recover from any adverse ef- on high-bandwidth FPGAs with low power con- fects of randomness. For example, previous work sumption (Hadaeghi et al., 2017; Tanaka et al., has shown that ResNets, with all of their parame- 2019), or on optical devices (Hicke et al., ters fixed except for the scale and shift parameters 2013).

Load more