Bayesian Neural Networks from a Gaussian Process Perspective

Bayesian Neural Networks from a Gaussian Process Perspective Andrew Gordon Wilson https://cims.nyu.edu/~andrewgw Courant Institute of Mathematical Sciences Center for Data Science New York University Gaussian Process Summer School September 16, 2020 1 / 47 Last Time... Machine Learning for Econometrics (The Start of My Journey...) Autoregressive Conditional Heteroscedasticity (ARCH) 2003 Nobel Prize in Economics 2 y(t) = N (y(t); 0; a0 + a1y(t − 1) ) 2 / 47 Autoregressive Conditional Heteroscedasticity (ARCH) 2003 Nobel Prize in Economics 2 y(t) = N (y(t); 0; a0 + a1y(t − 1) ) Gaussian Copula Process Volatility (GCPV) (My First PhD Project) y(x) = N (y(x); 0; f (x)2) f (x) ∼ GP(m(x); k(x; x0)) I Can approximate a much greater range of variance functions I Operates on continuous inputs x I Can effortlessly handle missing data I Can effortlessly accommodate multivariate inputs x (covariates other than time) I Observation: performance extremely sensitive to even small changes in kernel hyperparameters 3 / 47 Heteroscedasticity revisited... Which of these models do you prefer, and why? Choice 1 y(x)jf (x); g(x) ∼ N (y(x); f (x); g(x)2) f (x) ∼ GP; g(x) ∼ GP Choice 2 y(x)jf (x); g(x) ∼ N (y(x); f (x)g(x); g(x)2) f (x) ∼ GP; g(x) ∼ GP 4 / 47 Some conclusions... I Flexibility isn’t the whole story, inductive biases are at least as important. I Degenerate model specification can be helpful, rather than something to necessarily avoid. I Asymptotic results often mean very little. Rates of convergence, or even intuitions about non-asymptotic behaviour, are more meaningful. I Infinite models (models with unbounded capacity) are almost always desirable, but the details matter. I Releasing good code is crucial. I Try to keep the approach as simple as possible. I Empirical results often provide the most effective argument. 5 / 47 Model Selection 700 600 500 400 300 200 Airline Passengers (Thousands) 100 1949 1951 1953 1955 1957 1959 1961 Year Which model should we choose? 3 104 X j X j (1): f1(x) = w0 + w1x (2): f2(x) = wjx (3): f3(x) = wjx j=0 j=0 6 / 47 A Function-Space View Consider the simple linear model, f (x) = w0 + w1x ; (1) w0; w1 ∼ N (0; 1) : (2) 25 20 15 10 5 0 −5 Output, f(x) −10 −15 −20 −25 −10 −8 −6 −4 −2 0 2 4 6 8 10 Input, x 7 / 47 Model Construction and Generalization p( ) D|M Well-Specified Model Calibrated Inductive Biases Example: CNN Simple Model Poor Inductive Biases Example: Linear Function Complex Model Poor Inductive Biases Example: MLP Corrupted CIFAR-10 MNIST Dataset CIFAR-10 Structured Image Datasets 8 / 47 How do we learn? I The ability for a system to learn is determined by its support (which solutions are a priori possible) and inductive biases (which solutions are a priori likely). I We should not conflate flexibility and complexity. I An influx of new massive datasets provide great opportunities to automatically learn rich statistical structure, leading to new scientific discoveries. Bayesian Deep Learning and a Probabilistic Perspective of Generalization Wilson and Izmailov, 2020 arXiv 2002.08791 9 / 47 What is Bayesian learning? I The key distinguishing property of a Bayesian approach is marginalization instead of optimization. I Rather than use a single setting of parameters w, use all settings weighted by their posterior probabilities in a Bayesian model average. 10 / 47 Why Bayesian Deep Learning? Recall the Bayesian model average (BMA): Z p(yjx∗; D) = p(yjx∗; w)p(wjD)dw : (3) I Think of each setting of w as a different model. Eq. (3) is a Bayesian model average over models weighted by their posterior probabilities. I Represents epistemic uncertainty over which f (x; w) fits the data. I Can view classical training as using an approximate posterior q(wjy; X) = δ(w = wMAP). I The posterior p(wjD) (or loss L = − log p(wjD)) for neural networks is extraordinarily complex, containing many complementary solutions, which is why BMA is especially significant in deep learning. I Understanding the structure of neural network loss landscapes is crucial for better estimating the BMA. 11 / 47 Mode Connectivity Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, A.G. Wilson. NeurIPS 2018. Loss landscape figures in collaboration with Javier Ideami (losslandscape.com). 12 / 47 Mode Connectivity 13 / 47 Mode Connectivity 14 / 47 Mode Connectivity 15 / 47 Mode Connectivity 16 / 47 Better Marginalization Z p(yjx∗; D) = p(yjx∗; w)p(wjD)dw : (4) I MultiSWAG forms a Gaussian mixture posterior from multiple independent SWAG solutions. I Like deep ensembles, MultiSWAG incorporates multiple basins of attraction in the model average, but it additionally marginalizes within basins of attraction for a better approximation to the BMA. 17 / 47 Better Marginalization: MultiSWAG [1] Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Ovadia et. al, 2019 [2] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson and Izmailov, 2020 18 / 47 Double Descent Belkin et. al (2018) Reconciling modern machine learning practice and the bias-variance trade-off. Belkin et. al, 2018 19 / 47 Double Descent Should a Bayesian model experience double descent? 20 / 47 Bayesian Model Averaging Alleviates Double Descent CIFAR-100, 20% Label Corruption 50 SGD 45 SWAG Multi-SWAG 40 35 Test Error (%) 30 10 20 30 40 50 ResNet-18 Width Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020 21 / 47 Neural Network Priors A parameter prior p(w) = N (0; α2) with a neural network architecture f (x; w) induces a structured distribution over functions p(f (x)). Deep Image Prior I Randomly initialized CNNs without training provide excellent performance for image denoising, super-resolution, and inpainting: a sample function from p(f (x)) captures low-level image statistics, before any training. Random Network Features I Pre-processing CIFAR-10 with a randomly initialized untrained CNN dramatically improves the test performance of a Gaussian kernel on pixels from 54% accuracy to 71%, with an additional 2% from `2 regularization. [1] Deep Image Prior. Ulyanov, D., Vedaldi, A., Lempitsky, V. CVPR 2018. [2] Understanding Deep Learning Requires Rethinking Generalzation. Zhang et. al, ICLR 2016. [3] Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020. 22 / 47 Tempered Posteriors In Bayesian deep learning it is typical to consider the tempered posterior: 1 1=T pT (wjD) = p(Djw) p(w); (5) Z(T) where T is a temperature parameter, and Z(T) is the normalizing constant corresponding to temperature T. The temperature parameter controls how the prior and likelihood interact in the posterior: I T < 1 corresponds to cold posteriors, where the posterior distribution is more concentrated around solutions with high likelihood. I T = 1 corresponds to the standard Bayesian posterior distribution. I T > 1 corresponds to warm posteriors, where the prior effect is stronger and the posterior collapse is slower. E.g.: The safe Bayesian. Grunwald, P. COLT 2012. 23 / 47 Cold Posteriors Wenzel et. al (2020) highlight the result that for p(w) = N(0; I) cold posteriors with T < 1 often provide improved performance. How good is the Bayes posterior in deep neural networks really? Wenzel et. al, ICML 2020. 24 / 47 Prior Misspecification? They suggest the result is due to prior misspecification, showing that sample functions p(f (x)) seem to assign one label to most classes on CIFAR-10. 25 / 47 Changing the prior variance scale α Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Wilson & Izmailov, 2020. 26 / 47 The effect of data on the posterior 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Class Probability 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 Class Probability 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 (a) Prior (α = p10) (b) 10 datapoints 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 (c) 100 datapoints (d) 1000 datapoints 27 / 47 Neural Networks from a Gaussian Process Perspective From a Gaussian process perspective, what properties of the prior over functions induced by a Bayesian neural network might you check to see if it seems reasonable? 28 / 47 Prior Class Correlations 1.00 1.0 7 0.97 0.97 0.97 0.97 0.98 7 0.81 0.82 0.85 0.84 0.88 0.98 0.9 4 0.97 0.97 0.97 0.98 0.97 4 0.81 0.79 0.83 0.89 0.84 0.96 0.8 2 0.97 0.97 0.98 0.97 0.97 2 0.83 0.82 0.89 0.83 0.85 0.94 0.7 1 0.96 0.99 0.97 0.97 0.97 1 0.75 0.90 0.82 0.79 0.82 MNIST Class MNIST Class 0.92 0.6 0 0.98 0.96 0.97 0.97 0.97 0 0.89 0.75 0.83 0.81 0.81 0.90 0.5 0 1 2 4 7 0 1 2 4 7 MNIST Class MNIST Class (e) α = 0:02 (f) α = 0:1 1.0 7 0.76 0.79 0.80 0.81 0.85 0.9 4 0.76 0.78 0.80 0.85 0.81 4 0.8 10 2 0.77 0.80 0.84 0.80 0.80 5 103 · 0.7 1 0.71 0.89 0.80 0.78 0.79 NLL MNIST Class 0.6 103 0 0.85 0.71 0.77 0.76 0.76 5 102 0.5 · 0 1 2 4 7 2 1 0 1 10− 10− 10 10 MNIST Class Prior std α (g) α = 1 (h) Bayesian Deep Learning and a Probabilistic Perspective of Generalization.

Bayesian Neural Networks from a Gaussian Process Perspective

Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

Deep Neural Networks As Gaussian Processes

Gaussian Process Dynamical Models for Human Motion

Modelling Multi-Object Activity by Gaussian Processes 1

Financial Time Series Volatility Analysis Using Gaussian Process State-Space Models

Gaussian-Random-Process.Pdf

Gaussian Markov Processes

FULLY BAYESIAN FIELD SLAM USING GAUSSIAN MARKOV RANDOM FIELDS Huan N

Mean Field Methods for Classification with Gaussian Processes

When Gaussian Process Meets Big Data

Stochastic Optimal Control Using Gaussian Process Regression Over Probability Distributions

The Variational Gaussian Process