Thesis and the Enjoyable Discussions During the Viva, and Their Suggestions for Improvement
Total Page:16
File Type:pdf, Size:1020Kb
Sparse Gaussian Process Approximations and Applications Mark van der Wilk Department of Engineering University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Jesus College November 2018 Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes, tables and equations and has fewer than 150 figures. Mark van der Wilk November 2018 Acknowledgements The journey towards finishing my PhD was influenced (and improved) by a considerable number of people. To start, I would like to thank my supervisor, Carl Rasmussen for all the insight and advice. I always enjoyed his clarity of thought, particularly the many explanations which hit the nail on the head in a minimal (and sometimes entertaining) way. I also very much appreciated working with James Hensman towards the end of my PhD. I learned a great deal from his vast knowledge on how to manipulate Gaussian processes, and from his optimism and knack for identifying where our models would be most practically useful. I would also like to thank Matthias Bauer for a fruitful collaboration, and members of the research group for continuously useful, enjoyable and challenging discussions, particularly Koa Heaukulani, Yarin Gal, David Lopez-Paz, Rowan McAllister, Alessan- dro Ialongo, David Duvenaud, and Roger Frigola. The stimulating environment in the Machine Learning Group would, of course, not have been possible without Zoubin Ghahramani, Carl Rasmussen, and Richard Turner. I would also like Richard Turner and Neil Lawrence for examining my thesis and the enjoyable discussions during the viva, and their suggestions for improvement. This PhD was funded by the UK Engineering and Physics Research Council (EPSRC), and Qualcomm, through the Qualcomm Innovation Fellowship (2015). On a personal note, thanks to Rainbow for her patience and brightness throughout the entire journey. And finally, I would like to thank my parents, Nino and Katy, for spurring on my interest in science by showing me how things work, and for teaching me that great outcomes are accompanied by great effort. Abstract Many tasks in machine learning require learning some kind of input-output relation (function), for example, recognising handwritten digits (from image to number) or learning the motion behaviour of a dynamical system like a pendulum (from positions and velocities now to future positions and velocities). We consider this problem using the Bayesian framework, where we use probability distributions to represent the state of uncertainty that a learning agent is in. In particular, we will investigate methods which use Gaussian processes to represent distributions over functions. Gaussian process models require approximations in order to be practically useful. This thesis focuses on understanding existing approximations and investigating new ones tailored to specific applications. We advance the understanding of existing tech- niques first through a thorough review. We propose desiderata for non-parametric basis function model approximations, which we use to assess the existing approximations. Following this, we perform an in-depth empirical investigation of two popular approxi- mations (VFE and FITC). Based on the insights gained, we propose a new inter-domain Gaussian process approximation, which can be used to increase the sparsity of the approximation, in comparison to regular inducing point approximations. This allows GP models to be stored and communicated more compactly. Next, we show that inter-domain approximations can also allow the use of models which would otherwise be impractical, as opposed to improving existing approximations. We introduce an inter-domain approximation for the Convolutional Gaussian process – a model that makes Gaussian processes suitable to image inputs, and which has strong relations to convolutional neural networks. This same technique is valuable for approximating Gaussian processes with more general invariance properties. Finally, we revisit the derivation of the Gaussian process State Space Model, and discuss some subtleties relating to their approximation. We hope that this thesis illustrates some benefits of non-parametric models and their approximation in a non-parametric fashion, and that it provides models and approximations that prove to be useful for the development of more complex and performant models in the future. Table of contents List of figures xiii List of tables xix Nomenclature xxi Notation xxiii 1 Introduction 1 1.1 Bayesian Machine Learning . 2 1.1.1 Probability theory: a way to reason under uncertainty . 2 1.1.2 Bayesian modelling & inference . 3 1.1.3 Predictions and decisions . 6 1.1.4 Practical considerations . 8 1.2 Learning mappings with Gaussian processes . 9 1.2.1 Basis function models . 9 1.2.2 Notation . 12 1.2.3 From basis functions to function values . 13 1.2.4 Gaussian processes . 15 1.2.5 How many basis functions? . 17 1.3 Why approximate non-parametric models? . 21 1.4 Main contributions . 22 2 Sparse Gaussian process approximations 25 2.1 Desiderata for non-parametric approximations . 26 2.2 Explicit feature representations . 28 2.2.1 Random feature expansions . 28 2.2.2 Optimised features . 28 2.2.3 Desiderata . 29 x Table of contents 2.3 Inducing point approximations . 29 2.3.1 As likelihood approximations . 30 2.3.2 As model approximations . 31 2.3.3 Consistency with an approximate GP . 32 2.3.4 Deterministic Training Conditional (DTC) approximation . 33 2.3.5 Fully Independent Training Conditional (FITC) approximation . 36 2.4 Inducing point posterior approximations . 37 2.4.1 A class of tractable Gaussian process posteriors . 37 2.4.2 Variational inference . 40 2.4.3 Expectation propagation . 44 2.5 Conclusion . 45 3 Understanding the behaviour of FITC and VFE 47 3.1 Common notation . 48 3.2 Comparative behaviour . 48 3.2.1 FITC can severely underestimate the noise variance, VFE over- estimates it . 49 3.2.2 VFE improves with additional inducing inputs, FITC may ignore them . 51 3.2.3 FITC does not recover the true posterior, VFE does . 53 3.2.4 FITC relies on local optima . 54 3.2.5 VFE is hindered by local optima . 56 3.2.6 FITC can violate a marginal likelihood upper bound . 57 3.2.7 Conclusion . 59 3.3 Parametric models, identical bounds . 60 3.3.1 FITC and GP regression share a lower bound . 61 3.3.2 L as an approximation to FITC . 62 3.3.3 Conclusion . 65 4 Inter-domain basis functions for flexible variational posteriors 67 4.1 Related work . 68 4.2 Inter-domain inducing variables . 69 4.3 Inter-domain posteriors . 70 4.4 Basis functions for inter-domain projections . 72 4.4.1 Computational complexity . 73 4.4.2 A motivating example . 74 4.5 Experiments . 76 Table of contents xi 4.5.1 Sparsity of the resulting models . 77 4.5.2 Sparsity for known hyperparameters . 81 4.5.3 Compression of the posterior . 83 4.5.4 Computational comparison . 86 4.6 Discussion . 86 4.7 Conclusion & Outlook . 89 5 Convolutions and Invariances 91 5.1 Improving generalisation through invariances . 92 5.1.1 Model complexity and marginal likelihoods . 93 5.1.2 Invariances help generalisation . 95 5.2 Encoding invariances in kernels . 98 5.2.1 Computational issues . 99 5.2.2 Inter-domain variables for invariant kernels . 100 5.3 Convolutional Gaussian processes . 101 5.3.1 Constructing convolutional kernels . 102 5.3.2 Inducing patch approximations . 103 5.4 Required number of inducing patches . 104 5.4.1 Inducing points . 104 5.4.2 Inducing patches . 106 5.4.3 Comparing inducing points and inducing patches . 108 5.5 Translation invariant convolutional kernels . 110 5.5.1 Toy demo: rectangles . 110 5.5.2 Illustration: Zeros vs ones MNIST . 111 5.5.3 Full MNIST . 112 5.6 Weighted convolutional kernels . 113 5.6.1 Toy demo: Rectangles . 113 5.6.2 Illustration: Zeros vs ones MNIST . 114 5.6.3 Full MNIST . 114 5.7 Is convolution too much invariance? . 115 5.7.1 Convolutional kernels are not universal . 115 5.7.2 Adding a characteristic kernel component . 116 5.7.3 MNIST . 118 5.8 Convolutional kernels for colour images . 120 5.8.1 CIFAR-10 . 121 5.9 Comparison to convolutional neural networks . 123 5.10 Notes on implementation . 124 xii Table of contents 5.11 Conclusions . 125 6 New insights into random-input Gaussian process models 127 6.1 Issues with augmentation . 128 6.2 Conditioning in GP regression . 129 6.3 Adding uncertainty on the inputs . 130 6.3.1 Gibbs sampling . 132 6.3.2 Variational inference . 133 6.4 Gaussian Process State Space Models . 134 6.4.1 Model overview . 135 6.4.2 Graphical model . 136 6.4.3 Variational inference . 137 6.5 Conclusion . 138 7 Discussion 139 7.1 Non-parametrics and modern deep learning . 139 7.2 Summary of contributions . 141 7.3 Future directions . 142 7.4 Conclusion . 146 References 147 Appendix A Inter-domain inducing features 157 A.1 Time-Frequency Inducing Features . 157 Appendix B Inducing point updates for VFE and FITC 159 B.1 Adding a new inducing point . 159 B.2 The VFE objective function always improves when adding an additional inducing input . 160 B.3 The heteroscedastic noise is.