Download File
Total Page:16
File Type:pdf, Size:1020Kb
Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 c 2018 Francois Fagan All Rights Reserved ABSTRACT Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan A core task in machine learning, and the topic of this thesis, is developing faster and more accurate methods of posterior inference in probabilistic models. The thesis has two components. The first explores using deterministic methods to improve the efficiency of Markov Chain Monte Carlo (MCMC) algorithms. We propose new MCMC algorithms that can use deterministic methods as a \prior" to bias MCMC proposals to be in areas of high posterior density, leading to highly efficient sampling. In Chapter 2 we develop such methods for continuous distributions, and in Chapter 3 for binary distributions. The resulting methods consistently outperform existing state-of-the-art sampling techniques, sometimes by several orders of magnitude. Chapter 4 uses similar ideas as in Chapters 2 and 3, but in the context of modeling the performance of left-handed players in one-on-one interactive sports. The second part of this thesis explores the use of stable stochastic gradient descent (SGD) methods for computing a maximum a posteriori (MAP) estimate in large-scale machine learning problems. In Chapter 5 we propose two such methods for softmax regression. The first is an implementation of Implicit SGD (ISGD), a stable but difficult to implement SGD method, and the second is a new SGD method specifically designed for optimizing a double-sum formulation of the softmax. Both methods comprehensively outperform the previous state-of-the-art on seven real world datasets. Inspired by the success of ISGD on the softmax, we investigate its application to neural networks in Chapter 6. In this chapter we present a novel layer-wise approximation of ISGD that has efficiently computable updates. Experiments show that the resulting method is more robust to high learning rates and generally outperforms standard backpropagation on a variety of tasks. Table of Contents List of Figures vii List of Tables xi 1 Introduction 1 I Advances in Bayesian inference: Leveraging deterministic methods for ef- ficient MCMC inference 5 2 Elliptical Slice Sampling with Expectation Propagation 6 2.1 Introduction . .6 2.2 Expectation propagation and elliptical slice sampling . .8 2.2.1 Elliptical slice sampling . .8 2.2.2 Expectation propagation . 10 2.2.3 Elliptical Slice Sampling with Expectation Propagation . 11 2.3 Recycled elliptical slice sampling . 13 2.4 Analytic elliptical slice sampling . 17 2.4.1 Analytic EPESS for TMG . 19 2.5 Experimental results . 20 2.5.1 Probit regression . 21 2.5.2 Truncated multivariate Gaussian . 22 2.5.3 Log-Gaussian Cox process . 24 2.6 Discussion and conclusion . 24 i 3 Annular Augmentation Sampling 26 3.1 Introduction . 26 3.2 Annular augmentation sampling . 28 3.2.1 Annular Augmentations . 29 3.2.2 Generic Sampler . 31 3.2.3 Operator 2: Rao-Blackwellized Gibbs . 32 3.2.4 Choice And Effect Of The Priorp ^ ........................ 34 3.3 Experimental results . 35 3.3.1 2D Ising Model . 36 3.3.2 Heart Disease Risk Factors . 39 3.4 Conclusion . 41 4 The Advantage of Lefties in One-On-One Sports 42 4.1 Introduction and Related Literature . 42 4.2 The Latent Skill and Handedness Model . 48 4.2.1 Medium-Tailed Priors for the Innate Skill Distribution . 49 4.2.2 Large N Inference Using Only Aggregate Handedness Data . 51 4.2.3 Interpreting the Posterior of L .......................... 53 4.3 Including Match-Play and Handedness Data . 56 4.3.1 The Posterior Distribution . 57 4.3.2 Inference Via MCMC . 58 4.4 Using External Rankings and Handedness Data . 62 4.5 Numerical Results . 63 4.5.1 Posterior Distribution of L ............................ 64 4.5.2 On the Importance of Conditioning on Topn;N ................. 64 4.5.3 Posterior of Skills With and Without the Advantage of Left-handedness . 66 4.6 A Dynamic Model for L .................................. 69 4.7 Conclusions and Further Research . 73 ii II Advances in numerical stability for large-scale machine learning problems 76 5 Stable and scalable softmax optimization with double-sum formulations 77 5.1 Introduction . 77 5.2 Convex double-sum formulation . 80 5.2.1 Instability of ESGD . 81 5.2.2 Choice of double-sum formulation . 82 5.3 Stable and scalable SGD methods . 83 5.3.1 Implicit SGD . 83 5.3.2 U-max method . 86 5.4 Experiments . 88 5.4.1 Experimental setup . 88 5.4.2 Results . 91 5.5 Conclusion . 94 5.6 Extensions . 94 6 Implicit backpropagation 95 6.1 Introduction . 95 6.2 ISGD and related methods . 97 6.2.1 ISGD method . 97 6.2.2 Related methods . 98 6.3 Implicit Backpropagation . 99 6.4 Implicit Backpropagation updates for various activation functions . 101 6.4.1 Generic updates . 101 6.4.2 Relu update . 102 6.4.3 Arctan update . 104 6.4.4 Piecewise-cubic function update . 104 6.4.5 Relative run time difference of IB vs EB measured in flops . 104 6.5 Experiments . 106 6.6 Conclusion . 109 iii III Bibliography 111 IV Appendices 129 A Elliptical Slice Sampling with Expectation Propagation 130 A.1 Monte Carlo integration . 130 A.2 Markov Chain Monte Carlo (MCMC) . 130 A.3 Slice sampling . 131 B Annular Augmentation Sampling 132 B.1 Rao-Blackwellization . 132 B.2 2D-Ising model results . 132 B.2.1 LBP accuracy . 133 B.2.2 Benefit of Rao-Blackwellization . 133 B.2.3 RMSE for node and pairwise marginals . 133 B.3 Heart disease dataset . 139 B.4 CMH + LBP sampler . 139 B.5 Simulated data . 141 B.5.1 d =10........................................ 141 C The Advantage of Lefties in One-On-One Sports 143 C.1 Proof of Proposition 1 . 143 C.2 Rate of Convergence for Probability of Left-Handed Top Player Given Normal Innate Skills . 146 C.3 Difference in Skills at Given Quantiles with Laplace Distributed Innate Skills . 150 C.4 Estimating the Kalman Filtering Smoothing Parameter . 151 D Stable and scalable softmax optimization with double-sum formulations 153 D.1 Comparison of double-sum formulations . 153 D.2 Proof of variable bounds and strong convexity . 155 D.3 Proof of convergence of U-max method . 158 D.4 Stochastic Composition Optimization . 160 iv D.5 Proof of general ISGD gradient bound . 161 D.6 Update equations for ISGD . 161 D.6.1 Single datapoint, single class . 162 D.6.2 Bound on step size . 167 D.6.3 Single datapoint, multiple classes. 168 D.6.4 Multiple datapoints, multiple classes . 171 D.7 Comparison of double-sum formulations . 175 D.8 Results over epochs . 177 D.9 L1 penalty . ..