Reservoir Transformers

Total Page:16

File Type:pdf, Size:1020Kb

Reservoir Transformers Reservoir Transformers Sheng Sheny, Alexei Baevskiz, Ari S. Morcosz, Kurt Keutzery, Michael Auliz, Douwe Kielaz yUC Berkeley; zFacebook AI Research [email protected], [email protected] Abstract is more, we find that freezing layers may actually improve performance. We demonstrate that transformers obtain im- Beyond desirable efficiency gains, random lay- pressive performance even when some of the ers are interesting for several additional reasons. layers are randomly initialized and never up- dated. Inspired by old and well-established Fixed randomly initialized networks (Gallicchio ideas in machine learning, we explore a variety and Scardapane, 2020) converge to Gaussian pro- of non-linear “reservoir” layers interspersed cesses in the limit of infinite width (Daniely et al., with regular transformer layers, and show im- 2016), have intriguing interpretations in metric provements in wall-clock compute time until learning (Rosenfeld and Tsotsos, 2019; Giryes convergence, as well as overall performance, et al., 2016), and have been shown to provide on various machine translation and (masked) excellent “priors” either for subsequent learn- language modelling tasks. ing (Ulyanov et al., 2018) or pruning (Frankle and 1 Introduction Carbin, 2018). Fixed layers allow for efficient low-cost hardware implementations (Schrauwen Transformers (Vaswani et al., 2017) have dom- et al., 2007) and can be characterized using only a inated natural language processing (NLP) in re- random number generator and its seed. This could cent years, from large scale machine transla- facilitate distributed training and enables highly tion (Ott et al., 2018) to pre-trained (masked) efficient deployment to edge devices, since it only language modeling (Devlin et al., 2018; Rad- requires transmission of a single number. The ford et al., 2018), and are becoming more pop- strong performance of networks with fixed lay- ular in other fields as well, from reinforcement ers also sheds new light on the inner workings learning (Vinyals et al., 2019) to speech recog- of BERT (Devlin et al., 2018), and layer-wise in- nition (Baevski et al., 2019) and computer vi- terpretations of such models (Rogers et al., 2020; sion (Carion et al., 2020). Their success is enabled Tenney et al., 2019). It appears that “not all layers in part by ever increasing computational demands, are created equal” (Zhang et al., 2019) is true to which has naturally led to an increased interest such an extent that some layers can simply remain in improving their efficiency. Scalability gains in random and fixed. transformers could facilitate bigger, deeper net- Random projections have a long history in arXiv:2012.15045v2 [cs.CL] 1 Jun 2021 works with longer contexts (Kitaev et al., 2020; machine learning. By Cover’s theorem (Cover, Wang et al., 2020; Beltagy et al., 2020; Kaplan 1965), any high-dimensional non-linear transfor- et al., 2020; Tay et al., 2020b). Conversely, mation is more likely to be linearly separable improved efficiency could reduce environmental than its lower-or-equal-dimensional input space. costs (Strubell et al., 2019) and hopefully help de- By Johnson-Lindenstrauss (Johnson and Linden- mocratize the technology. strauss, 1984), random projections distort Eu- In this work, we explore a simple question: if clidean distances very little under mild assump- some layers of the transformer are kept frozen— tions, which is useful e.g. for dimensionality re- i.e., never updated after random initialization— duction and random indexing (Sahlgren, 2005). can we match the performance of fully learned Fixed random layers in neural networks pre-date transformers, while being more efficient? Surpris- deep learning by far (Gamba et al., 1961; Baum, ingly, the answer is resoundingly yes; and what 1988). Indeed, random kernel methods have long been influential in machine learning (Rahimi and approximating upstream gradients using an Recht, 2008, 2009). approach we call backskipping, which can One way to think of such layers is as “reser- reduce the training compute further without voirs” (Lukoseviˇ ciusˇ and Jaeger, 2009), where a sacrificing performance. highly non-linear high-dimensional black box rep- resentation is provided to a lightweight “readout” 2 Approach network, as in echo state networks (Jaeger, 2003) This paper is based on a very simple idea. Neural and liquid state machines (Maass et al., 2002). The networks are trained via backpropagation, which benefit of such an approach is that the reservoir has involves consecutive steps of matrix addition and fixed parameters and is computationally efficient, multiplication, i.e., as it can be pre-computed and does not (necessar- ily) require backpropagation. In NLP, Wieting and Kiela(2019) showed that @J @J @J @Ln @L0 θt+1 θt − η ; = ··· random sentence encoders present a strong base- @θt @θt @Ln @Ln−1 @x line for text classification, with subsequent work for some objective J, parameterization θ and showing applications in a variety of tasks from learning rate η, with the gradient computed via summarization to machine translation (Enguehard the chain rule, where Li is the i-th layer of the et al., 2019; Garg et al., 2020; Pilault et al., 2020). neural network and x is the input. Let L = To our knowledge, this work is the first to exam- Transformer(X) be a single layer in a Transformer ine this phenomenon in transformers, and the first network (Vaswani et al., 2017), i.e., to recursively alternate reservoirs with subsequent transformer layers acting as readout functions. We introduce “reservoir transformers”, wherein fixed H = MultiHeadSelfAttn(LayerNorm(X)) + X random reservoir layers are interspersed with reg- L = FFN(LayerNorm(H)) + H ular updateable transformer layers. The goal of this work is to put our understanding of trans- Now, during every “backward pass”, we com- former models on a more solid footing by provid- pute the Jacobian for parameters θL at layer L, L ing empirical evidence of their capabilities even which are used to update the parameters of L, θt , when some of their parameters are fixed. Our con- as well as to compute the next layer’s Jacobian, tributions are as follows: thus back-propagating the gradients. In this work however, for some of the layers, we still backprop- area under the convergence • We introduce a agate through them to compute gradients for ear- curve metric for measuring performance- lier layers, but we never apply the parameter up- efficiency trade-offs, and show that replacing date. As a result, these layers stay fixed at their regular transformer layers with reservoir lay- initialization, saving computational resources. ers leads to improvements. 2.1 Background • We show that the addition of reservoir layers leads to improved test set generalization on a Naturally, never updating some of the parameters variety of tasks in a variety of settings. is computationally more efficient, as some matrix addition operations can be skipped in the back- • We show that pre-trained masked lan- ward pass, but why is this not detrimental to the guage modelling architectures like BERT and performance of the network? RoBERTa (Liu et al., 2019) can benefit from In the early days of neural networks, the bot- having some of their layers frozen, both dur- tom layers were often kept fixed as “associa- ing pre-training as well as when fine-tuning tors” (Block, 1962), or what (Minsky and Papert, on downstream tasks. 2017) called the Gamba perceptron (Gamba et al., 1961; Borsellino and Gamba, 1961). Fixed ran- • We experiment with different types of reser- dom networks (Baum, 1988; Schmidt et al., 1992; voir layers, including convolutional and re- Pao et al., 1994) have been explored from many current neural network-based ones. angles, including as “random kitchen sink” kernel • We show empirical evidence that the back- machines (Rahimi and Recht, 2008, 2009), “ex- ward pass can be skipped in its entirety by treme learning machines” (Huang et al., 2006) and reservoir computing (Jaeger, 2003; Maass et al., reservoir computing, most of which builds on 2002; Lukoseviˇ ciusˇ and Jaeger, 2009). In reser- recurrent neural network architectures. voir computing, input data are represented through fixed random high-dimensional non-linear rep- • CNN Reservoir: A fixed Convolutional Neu- resentations, called “reservoirs”, which are fol- ral Network (LeCun et al., 1998) layer, lowed by a regular (often but not necessarily lin- specifically light dynamical convolution lay- ear) “readout” network to make the final classifi- ers (Wu et al., 2019), which are known to be cation decision. competitive with transformers in sequence- The theoretical justification for these ap- to-sequence tasks. proaches lies in two well-known results in ma- chine learning: Cover’s theorem (Cover, 1965) We find that all these approaches work well, to on the separability of patterns states that high- a certain extent. For clarity, we focus primarily on dimensional non-linear transformations are more the first two reservoir layers, but include a broader likely to be linearly separable; and the Johnson- comparison in AppendixA. Lindenstrauss lemma (Johnson and Lindenstrauss, In each case, contrary to traditional reservoir 1984) shows that (most) random projections dis- computing, our reservoir layers are interspersed tort Euclidean distances very little. throughout a regular transformer network, or what Practically, random layers can be seen as a we call a reservoir transformer. Since random pro- cheap way to increase network depth. There are jections are not learned and might introduce noise, interesting advantages to this approach. Fixed lay- subsequent normal transformer “readout” layers ers are known to have particularly low-cost hard- might be able to benefit from additional depth ware requirements and can be easily implemented while allowing us to recover from any adverse ef- on high-bandwidth FPGAs with low power con- fects of randomness. For example, previous work sumption (Hadaeghi et al., 2017; Tanaka et al., has shown that ResNets, with all of their parame- 2019), or on optical devices (Hicke et al., ters fixed except for the scale and shift parameters 2013).
Recommended publications
  • Charles Boutens Computing Optimal Input Encoding for Memristor Based Reservoir
    Optimal Input Encoding For Memristor Based Reservoir Computing Charles Boutens Supervisor: Prof. dr. ir. Joni Dambre Counsellor: Prof. dr. ir. Joni Dambre Master's dissertation submitted in order to obtain the academic degree of Master of Science in Engineering Physics Department of Electronics and Information Systems Chair: Prof. dr. ir. Rik Van de Walle Faculty of Engineering and Architecture Academic year 2016-2017 Optimal Input Encoding For Memristor Based Reservoir Computing Charles Boutens Supervisor(s): Prof. dr. ir. Joni Dambre1 Abstract| One of the most promising fields of unconventional ap- With Moore's law reaching its end, alternatives to the proaches to computation might be the brain-inspired field digital computing paradigm are being investigated. These of analogue, neuromorphic computing [4]. Here, the ulti- unconventional approaches range from quantum- and opti- cal computing to promising analogue, neuromorphic imple- mate vision is to use self-organised neural networks consist- mentations. A recent approach towards computation with ing of nano-scale components with variable properties and analogue, physical systems is the emerging field of physi- erroneous behaviour, inevitable at the nano-scale. These cal reservoir computing (PRC). By considering the physical system as an excitable, dynamical medium called a reser- neuromorphic approaches require highly connected com- voir, this approach enables the exploitation of the intrinsic plex neural networks with adaptive synapse-like connec- computational power available in all physical systems. tions. Recently [5] [6] [7] interest has arisen in the function- In this work the RC approach towards computation is ap- alities of locally connected switching networks. These net- plied to networks consisting of resistive switches (RS).
    [Show full text]
  • The Promise of Spintronics for Unconventional Computing
    The promise of spintronics for unconventional computing Giovanni Finocchio,1,* Massimiliano Di Ventra,2 Kerem Y. Camsari,3 Karin Everschor‐Sitte,4 Pedram Khalili Amiri,5 and Zhongming Zeng6 1Department of Mathematical and Computer Sciences, Physical Sciences and Earth Sciences, University of Messina, Messina, 98166, Italy ‐ Tel. +39.090.3975555 2Department of Physics, University of California, San Diego, La Jolla, California 92093, USA, ‐ Tel. +1.858.8226447 3School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana 47907, USA ‐ Tel. +1.765.4946442 4Institut für Physik, Johannes Gutenberg‐Universität Mainz, DE‐55099 Mainz, Germany ‐ Tel. +49.61313923645 5Department of Electrical and Computer Engineering, Northwestern University, Evanston, Illinois 60208, USA ‐ Tel. +1.847.4671035 6 Key Laboratory of Multifunctional Nanomaterials and Smart Systems,, Suzhou Institute of Nano‐tech and Nano‐bionics, Chinese Academy of Sciences, Ruoshui Road 398, Suzhou 215123, P. R. China ‐ Tel. +86.0512.62872759 1 Abstract Novel computational paradigms may provide the blueprint to help solving the time and energy limitations that we face with our modern computers, and provide solutions to complex problems more efficiently (with reduced time, power consumption and/or less device footprint) than is currently possible with standard approaches. Spintronics offers a promising basis for the development of efficient devices and unconventional operations for at least three main reasons: (i) the low-power requirements of spin-based devices, i.e., requiring no standby power for operation and the possibility to write information with small dynamic energy dissipation, (ii) the strong nonlinearity, time nonlocality, and/or stochasticity that spintronic devices can exhibit, and (iii) their compatibility with CMOS logic manufacturing processes.
    [Show full text]
  • Quantum Reservoir Computing: a Reservoir Approach Toward Quantum Machine Learning on Near-Term Quantum Devices
    Quantum reservoir computing: a reservoir approach toward quantum machine learning on near-term quantum devices Keisuke Fujii1, ∗ and Kohei Nakajima2, y 1Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan. 2Graduate School of Information Science and Technology, The University of Tokyo, Bunkyo-ku, 113-8656 Tokyo, Japan (Dated: November 11, 2020) Quantum systems have an exponentially large degree of freedom in the number of particles and hence provide a rich dynamics that could not be simulated on conventional computers. Quantum reservoir computing is an approach to use such a complex and rich dynamics on the quantum systems as it is for temporal machine learning. In this chapter, we explain quantum reservoir computing and related approaches, quantum extreme learning machine and quantum circuit learning, starting from a pedagogical introduction to quantum mechanics and machine learning. All these quantum machine learning approaches are experimentally feasible and effective on the state-of-the-art quantum devices. I. INTRODUCTION a useful task like machine learning, now applications of such a near-term quantum device for useful tasks includ- Over the past several decades, we have enjoyed expo- ing machine leanring has been widely explored. On the nential growth of computational power, namely, Moore's other hand, quantum simulators are thought to be much law. Nowadays even smart phone or tablet PC is much easier to implement than a full-fledged universal quan- more powerful than super computers in 1980s. Even tum computer. In this regard, existing quantum sim- though, people are still seeking more computational ulators have already shed new light on the physics of power, especially for artificial intelligence (machine learn- complex many-body quantum systems [9{11], and a re- ing), chemical and material simulations, and forecasting stricted class of quantum dynamics, known as adiabatic complex phenomena like economics, weather and climate.
    [Show full text]
  • Arxiv:1808.04962V3 [Cs.ET] 15 Apr 2019 Dynamical Systems, Neuromorphic Device 2010 MSC: 68Txx, 37N20
    Recent Advances in Physical Reservoir Computing: A Review Gouhei Tanakaa;b;1, Toshiyuki Yamanec, Jean Benoit H´erouxc, Ryosho Nakanea;b, Naoki Kanazawac, Seiji Takedac, Hidetoshi Numatac, Daiju Nakanoc, and Akira Hirosea;b aInstitute for Innovation in International Engineering Education, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8656, Japan bDepartment of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo 113-8656, Japan cIBM Research { Tokyo, Kanagawa 212-0032, Japan Abstract Reservoir computing is a computational framework suited for temporal/sequential data processing. It is derived from several recurrent neural network models, including echo state networks and liquid state machines. A reservoir comput- ing system consists of a reservoir for mapping inputs into a high-dimensional space and a readout for pattern analysis from the high-dimensional states in the reservoir. The reservoir is fixed and only the readout is trained with a simple method such as linear regression and classification. Thus, the major advan- tage of reservoir computing compared to other recurrent neural networks is fast learning, resulting in low training cost. Another advantage is that the reser- voir without adaptive updating is amenable to hardware implementation using a variety of physical systems, substrates, and devices. In fact, such physical reservoir computing has attracted increasing attention in diverse fields of re- search. The purpose of this review is to provide an overview of recent advances in physical reservoir computing by classifying them according to the type of the reservoir. We discuss the current issues and perspectives related to physical reservoir computing, in order to further expand its practical applications and develop next-generation machine learning systems.
    [Show full text]