A Deep Learning Theory for Neural Networks Grounded in Physics

Université de Montréal A deep learning theory for neural networks grounded in physics par Benjamin Scellier Département d’informatique et de recherche opérationnelle Faculté des arts et des sciences Thèse présentée en vue de l’obtention du grade de Philosophiæ Doctor (Ph.D.) en Informatique arXiv:2103.09985v2 [cs.LG] 23 Apr 2021 December 31, 2020 © Benjamin Scellier, 2020 Université de Montréal Faculté des arts et des sciences Cette thèse intitulée A deep learning theory for neural networks grounded in physics présentée par Benjamin Scellier a été évaluée par un jury composé des personnes suivantes : Irina Rish (président-rapporteur) Yoshua Bengio (directeur de recherche) Pierre-Luc Bacon (membre du jury) Yann Ollivier (examinateur externe) (représentant du doyen de la FESP) Abstract In the last decade, deep learning has become a major component of artificial intelligence, leading to a series of breakthroughs across a wide variety of domains. The workhorse of deep learning is the optimization of loss functions by stochastic gradient descent (SGD). Tradition- ally in deep learning, neural networks are differentiable mathematical functions, and the loss gradients required for SGD are computed with the backpropagation algorithm. However, the computer architectures on which these neural networks are implemented and trained suffer from speed and energy inefficiency issues, due to the separation of memory and processing in these architectures. To solve these problems, the field of neuromorphic computing aims at implementing neural networks on hardware architectures that merge memory and processing, just like brains do. In this thesis, we argue that building large, fast and efficient neural networks on neuromorphic architectures also requires rethinking the algorithms to imple- ment and train them. We present an alternative mathematical framework, also compatible with SGD, which offers the possibility to design neural networks in substrates that directly exploit the laws of physics. Our framework applies to a very broad class of models, namely those whose state or dynamics are described by variational equations. This includes physical systems whose equilibrium state minimizes an energy function, and physical systems whose trajectory minimizes an action functional (principle of least action). We present a simple procedure to compute the loss gradients in such systems, called equilibrium propagation (EqProp), which requires solely locally available information for each trainable parameter. Since many models in physics and engineering can be described by variational principles, our framework has the potential to be applied to a broad variety of physical systems, whose applications extend to various fields of engineering, beyond neuromorphic computing. Keywords: deep learning, machine learning, physical learning, equilibrium propagation, energy-based model, variational principle, principle of least action, local learning rule, stochastic gradient descent, Hopfield networks, resistive networks, circuit theory, principle of minimum dissipated power, co-content, neuromorphic computing 5 Contents Abstract . 5 List of Tables. 11 List of Figures. 13 Abbreviations List. 15 Acknowledgements . 19 Chapter 1. Introduction. 21 1.1. On Artificial Intelligence . 21 1.1.1. Human Intelligence as a Benchmark. 23 1.1.2. Machine Learning Basics . 23 1.2. Neural Networks . 24 1.2.1. Neuroscience Basics . 24 1.2.2. Artificial Neural Networks . 26 1.2.3. Energy-Based Models vs Differentiable Neural Networks. 27 1.2.4. Stochastic Gradient Descent . 28 1.2.5. Landscape of Loss Functions . 29 1.2.6. Deep Learning Revolution. 30 1.2.7. Graphics Processing Units . 30 1.2.8. The Von Neumann Bottleneck . 31 1.2.9. In-Memory Computing. 32 1.2.10. Challenges of Analog Computing. 33 1.3. A Deep Learning Theory for Neural Networks Grounded in Physics . 33 1.3.1. Physical Systems as Deep Learning Models. 34 1.3.2. Variational Principles of Physics as First Principles . 34 1.3.3. Universality of Variational Principles in Physics . 35 1.3.4. Rethinking the Notion of Computation. 35 1.3.5. A Novel Differentiation Method Compatible with Variational Principles . 36 7 1.4. Overview of the Manuscript and Link to Prior Works . 36 Chapter 2. Equilibrium Propagation: A Learning Algorithm for Systems Described by Variational Equations. 39 2.1. Stochastic Gradient Descent . 40 2.2. Energy-Based Models . 41 2.3. Gradient Formula . 42 2.4. Equilibrium Propagation . 43 2.5. Examples of Sum-Separable Energy-Based Models. 44 2.6. Fundamental Lemma . 46 2.7. Remarks . 47 Chapter 3. Training Continuous Hopfield Networks with Equilibrium Propagation . 49 3.1. Gradient Systems . 50 3.1.1. Gradient Systems as Energy-Based Models . 50 3.1.2. Training Gradient Systems with Equilibrium Propagation . 50 3.1.3. Transient Dynamics. 51 3.1.4. Recurrent Backpropagation . 52 3.2. Continuous Hopfield Networks . 54 3.2.1. Hopfield Energy . 54 3.2.2. Training Continuous Hopfield Networks with Equilibrium Propagation . 56 3.2.3. ‘Backpropagation’ of Error Signals . 57 3.3. Numerical Experiments on MNIST. 58 3.3.1. Implementation Details . 59 3.3.2. Experimental Results . 61 3.4. Contrastive Hebbian Learning (CHL) . 62 3.4.1. Contrastive Hebbian Learning in the Continuous Hopfield Model . 62 3.4.2. An Intuition Behind Contrastive Hebbian Learning. 63 3.4.3. A Loss Function for Contrastive Hebbian Learning . 63 Chapter 4. Training Nonlinear Resistive Networks with Equilibrium Propagation . 65 8 4.1. Nonlinear Resistive Networks as Analog Neural Networks . 67 4.2. Nonlinear Resistive Networks are Energy-Based Models . 67 4.2.1. Linear Resistance Networks . 68 4.2.2. Two-Terminal Resistive Elements . 69 4.2.3. Nonlinear Resistive Networks . 70 4.3. Training Nonlinear Resistive Networks with Equilibrium Propagation. 72 4.3.1. Supervised Learning Setting . 72 4.3.2. Training Procedure . 72 4.3.3. On the Loss Gradient Estimates . 74 4.4. Example of a Deep Analog Neural Network Architecture . 75 4.4.1. Antiparallel Diodes . 75 4.4.2. Bidirectional Amplifiers . 76 4.4.3. Positive Weights . 77 4.4.4. Current.

Load more