A Study of the Mathematics of Deep Learning

A STUDY OF THE MATHEMATICS OF DEEP LEARNING by Anirbit Mukherjee A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland July, 2020 c 2020 Anirbit Mukherjee ⃝ All Rights Reserved This thesis is dedicated to my mother Dr. Suranjana Sur (“Mame”). Long before I had started learning arithmetic in school, my mother got me a geometry box and taught me how to construct various angles using a compass. Years before I had started formally studying any of the natural sciences, she created artificial clouds inside the home and taught me about condensation and designed experiments with plants to teach me how they do respiration. This thesis began with the scientific knowledge that my mother imparted regularly to me from right since I was a kid. ii Thesis Abstract Thesis Abstract ”Deep Learning”/”Deep Neural Nets” is a technological marvel that is now increasingly deployed at the cutting-edge of artificial intelligence tasks. This ongoing revolution can be said to havebeen ignited by the iconic 2012 paper from the University of Toronto titled “ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton. This paper showed that deep nets can be used to classify images into meaningful categories with almost human-like accuracies! As of 2020 this approach continues to produce unprecedented per- formance for an ever widening variety of novel purposes ranging from playing chess to self-driving cars to experimental astrophysics and high-energy physics. But this new found astonishing success of deep neural nets in the last few years has been hinged on an enormous amount of heuristics and it has turned out to be extremely challenging to be mathematically rigorously explainable. In this thesis we take several steps towards building strong theoretical foundations for these new paradigms of deep-learning. Our proofs here can be broadly grouped into three categories, Understanding Neural Function Spaces We show new circuit complexity theorems for deep • neural functions over real and Boolean inputs and prove classification theorems about these function spaces which in turn lead to exact algorithms for empirical risk minimization for depth 2 ReLU nets. We also motivate a measure of complexity of neural functions and leverage techniques from polytope geometry to constructively establish the existence of high-complexity neural functions. Understanding Deep Learning Algorithms We give fast iterative stochastic algorithms which • can learn near optimal approximations of the true parameters of a ReLU gate in the realizable setting. (There are improved versions of this result available in our papers Mukherjee and Muthukumar, 2020b; Mukherjee and Muthukumar, 2020a which are not included in the thesis.) iii Thesis Abstract We also establish the first ever (a) mathematical control on the behaviour of noisy gradient descent on a ReLU gate and (b) proofs of convergence of stochastic and deterministic versions of the widely used adaptive gradient deep-learning algorithms, RMSProp and ADAM. This study also includes a first-of-its-kind detailed empirical study of the hyper-parameter values and neural net architectures when these modern algorithms have a significant advantage over classical acceleration based methods. Understanding The Risk Of (Stochastic) Neural Nets We push forward the emergent tech- • nology of PAC-Bayesian bounds for the risk of stochastic neural nets to get bounds which are not only empirically smaller than contemporary theories but also demonstrate smaller rates of growth w.r.t increase in width and depth of the net in experimental tests. These critically depend on our novel theorems proving noise resilience of nets. This work also includes an experimental investigation of the geometric properties of the path in weight space that is traced out by the net during the training. This leads us to uncover certain seemingly uniform and surprising geometric properties of this process which can potentially be leveraged into better bounds in future. iv “Study hard what interests you the most in the most undisciplined, irreverent, and original manner possible.” - Richard Feynman Acknowledgements Sometime in Fall 2015, I landed at Prof. Amitabh Basu’s office as a jobless, penniless and homeless grad student looking for a fresh start in life while on leave from my previous physics grad school and my immigration documents being in a complete mess. My JHU story started in the middle of me giving up on academics as a whole. Only a couple of weeks before I had that out-of-the-blue meeting with my adviser Prof. Amitabh Basu, I was essentially putting up in a room in the attic of my friend Dr. Pooja Tyagi and her husband Dr. Lars Grant in the suburbs of Urbana-Champaign - because my lease there had run out and without a job I couldn’t have rented a new place. An year before this fortuitous meeting with Prof. Amitabh Basu, I was a jobless guy sharing a room with a bunch of strangers in a seedy hotel in a dark alley of Bangalore, India. I was roaming around the city trying to meet scientists there. (Long back during my undergrad I had once landed in Banga- lore at the “National Centre for Biological Sciences (NCBS)” looking to attend a conference on evolu- tionary biology - while I was considering doing that for research!) It was during these random travels in summer 2014 that I landed a meeting with the legendary mathematician Prof. Nikhil Srivastava (now at UC Berkeley) who was then at Microsoft Research, Bangalore. It was Prof. Nikhil Srivastava, who over lunch gave me the final push to switch paths away from physics and start doing mathematics full-time. Most critically Prof. Nikhil Srivastava got me a TA position with Prof. Alexandra Kolla in the Computer Science department of UIUC. (I do not think that ever before this the CS department of UIUC had ever hired a physics student as a TA! On many days I would read up multiple chapters of the book by Cormen, Leiserson, Rivest and Stein to be able to teach the undergrad class the next day!) This arrangement with Prof. Alexandra Kolla, gave me a funding mechanism which I used to take up every advanced course in theoretical CS that was on offer at UIUC and via participating in the research meetings with her I got my first glimpse of what “applied mathematics” looks like! Before this I had never seen anything about algorithms and invariably I bombed at the mid-semester exam of the extremely advanced graduate algorithms course of UIUC that I had taken up. But with enormous effort I managed to pass that course and mustered up the courage to register for the graduate complexity theory course that was being taught by Prof. Mahesh Vishwanathan. That was indeed a Acknowledgements turning point given how much I enjoyed the subject compared to almost everything that I had stud- ied before this. Prof. Mahesh kept encouraging me and he kept me going through thick and thin. I shall forever be indebted to Prof. Mahesh Vishwanathan for agreeing to write letters of recommen- dation for this complete outsider that I was and for so critically helping me restart a career outside physics! I stumbled into deep-learning while I randomly went to NYC to attend a conference on the same while not even knowing what a neural net is. The first time I listened to the talks, particularly by Prof. Anna Choromanska (NYU), on what the theoretical puzzles here are, I was somehow very strongly reminded of the nature of questions in String Theory that I so wanted to pursue while in physics. It was only much later that I came to know of the startling fact that arguably neural nets can be said to have majorly pioneered in Johns Hopkins University itself thanks to the series of historic papers that were co-authored by the living legend Terrence J. Sejnowski between 1983-1986, when he was at the Johns Hopkins University, Analyzing Cooperative Computation (with Geoffrey E. Hinton), Optimal Perceptual Inference (with Geoffrey E. Hinton), Boltzmann Machines : Constraint Satisfaction Networks That Learn (with Geoffrey E. Hinton and David H. Ackley) and A Learning Algorithm for Boltzmann Machines (with Geoffrey E. Hinton and David H. Ackley). It definitely comforting me to be able to connect to this interesting piece of history that much of the foundational ideas about neural nets were essentially invented in JHU - something that I have never found men- tioned enough in my neighbourhood. Possibly the single-most interesting idea in all of physics that had always captivated my imagination was the story of Large-N Matrix Models : a beautiful mathematical construction which is prima- facie a Quantum Field Theory but mysteriously enough it has time and again been shown to encode critical features of also being a gravitational theory and it has also been known to be a framework for addressing various deep questions in algebraic geometry. I so much wanted to pursue this subject in the physics department and yet that opportunity never arrived. Maybe deep down one of the things that got me motivated about deep-learning is its superficial resemblance to Matrix Models in physics - and I do hope that someday we will be able to figure out a robust connection between the two. Many a days in grad-school, I woke up in the morning feeling depressed about not being able to do this research. It didn’t help that during my Ph.D., the storm of Jackiw-Teitelboim gravity started blowing over physics : which is yet another exciting avatar of Matrix Models. It took a lot of psychological effort on my part to not disband all my ongoing thesis work and dive into this development which was clearly looking to be so much more exciting! On my table, I have always had open a copy of vii Acknowledgements Marcos Marino’s lectures on this topic - hoping if someday I could finally get to thinking about these beautiful questions.

A Study of the Mathematics of Deep Learning

Radford M. Neal Curriculum Vitae, 25 April 2021

An Empirical Study on Evaluation Metrics of Generativeadversarialnetworks

Uncertainty in Deep Learning

Adaptive Dropout for Training Deep Neural Networks

Time-Series Generative Adversarial Networks

Top 100 AI Leaders in Drug Discovery and Advanced Healthcare Introduction

Gaussian Error Linear Units (Gelus)

Deep Learning Meets Genome Biology

Deep Learning Approaches to Problems in Speech Recognition, Computational Chemistry, and Natural Language Text Processing

Machine Learning for Computer Graphics: a Manifesto and Tutorial

Deep Speech Recognition

Cartoon Character Generation Using Generative Adversarial Network Gourab Guruprasad , Gauri Gakhar ,D Vanusha