Scalable Gaussian Process Inference Using Variational Methods

Scalable Gaussian process inference using variational methods Alexander Graeme de Garis Matthews Department of Engineering University of Cambridge This dissertation is submitted for the degree of Doctor of Philosophy Darwin College September 2016 Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. This dissertation contains fewer than 65,000 words including appendices, bibliography, footnotes, tables and equations and has fewer than 150 figures. Alexander Graeme de Garis Matthews September 2016 Acknowledgements First I would like to thank my supervisor Zoubin Ghahramani. When I moved back to Cambridge I hoped that we would write about variational inference and indeed that was how it turned out. It has been a privilege to work so closely with him. He brings the best out in the people around him and has always supported my development. I am very much looking forward to continuing to work with him next year. My work with James Hensman began when he visited Zoubin in the summer of 2014. Since then we have worked together on the four related projects that eventually became this thesis. This working relationship was, in my view, an extremely productive one. I feel fortunate to have worked with James, who I consider to be a world expert in his area. I wish to thank my other collaborators. Richard Turner is a great person to work with. His door is always open and indeed on one visit to his office I had the debate that set chapter 3 in motion. I learnt a great deal about MCMC working with Maurizio Filippone on chapter 5. I will remember our marathon poster session at NIPS for a long time. I would like to thank Mark van der Wilk, Tom Nickson, and all the GPflow contributors who are listed on the project web page, along with Rasmus Munk Larsen who reviewed my contribution to TensorFlow. David MacKay, who sadly passed away this year, was an inspiration to me. Taking his now famous ITALA course changed the course of my career. He will be greatly missed. It is important that I acknowledge the thriving intellectual environment that is CBL in Cambridge. I would like to particularly highlight important conversations with Thang Bui and James Lloyd. This PhD was mostly funded by the UK Engineering and Physics Research Council. Thank you to them for their support. I am grateful to Carl Rasmussen and Stephen Roberts for examining my thesis. Any remaining inaccuracies are of course my own. Finally, on a personal note, I would like to thank my amazing wife Rachel and my family, particularly my parents who gave me every opportunity. More detail on the mapping between the conference papers I wrote with my coauthors and the chapters of this thesis can be found in the publications section. Abstract Gaussian processes can be used as priors on functions. The need for a flexible, principled, probabilistic model of functional relations is common in practice. Consequently, such an approach is demonstrably useful in a large variety of applications. Two challenges of Gaussian process modelling are often encountered. These are dealing with the adverse scaling with the number of data points and the lack of closed form posteriors when the likelihood is non-Gaussian. In this thesis, we study variational inference as a framework for meeting these challenges. An introductory chapter motivates the use of stochastic processes as priors, with a particular focus on Gaussian process modelling. A section on variational inference reviews the general definition of Kullback-Leibler divergence. The concept of prior conditional matching that is used throughout the thesis is contrasted to classical approaches to obtaining tractable variational approximating families. Various theoretical issues arising from the application of variational inference to the infinite dimensional Gaussian process setting are settled decisively. From this theorywe are able to give a new argument for existing approaches to variational regression that settles debate about their applicability. This view on these methods justifies the principled extensions found in the rest of the work. The case of scalable Gaussian process classification is studied, both for its own merits and as a case study for non-Gaussian likelihoods in general. Using the resulting algorithms we find credible results on datasets of a scale and complexity that was not possible before our work. An extension to include Bayesian priors on model hyperparameters is studied alongside a new inference method that combines the benefits of variational sparsity and MCMC methods. The utility of such an approach is shown on a variety of example modelling tasks. We describe GPflow, a new Gaussian process software library that uses TensorFlow. Implementations of the variational algorithms discussed in the rest of the thesis are included as part of the software. We discuss the benefits of GPflow when compared to other similar software. Increased computational speed is demonstrated in relevant, timed, experimental comparisons. Publications Chapter 3 expands on: Alexander G de G Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahra- mani. On Sparse Variational methods and the Kullback-Leibler divergence between stochastic processes. In 19th International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, May 2016. The key theorems are now stated and proved formally. Sections 3.3.4 and 3.5 are novel for this thesis. Section 3.4 has been rewritten. Chapter 4 and 5 correspond respectively to: James Hensman, Alexander G de G Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian Process Classification. In 18th International Conference on Artificial Intelligence and Statistics, San Diego, California, USA, May 2015. James Hensman, Alexander G de G Matthews, Maurizio Filippone, and Zoubin Ghahra- mani. MCMC for Variationally Sparse Gaussian Processes. In Advances in Neural Informa- tion Processing Systems 28, Montreal, Canada, December 2015. I have personally rewritten these papers for this thesis. In particular, it was felt that ‘Scalable Variational Gaussian Process Classification’ could be made significantly clearer in light of the theoretical innovations of Chapter 3. There are new experiments and some experiments have been moved. Chapter 5 now highlights the concept of variational potentials that did not appear in the conference paper. Both of these chapters benefit from no longer having to meet length constraints and from being presented harmonized with the other chapters. Chapter 6 describes an open source software project available at: https://github.com/GPflow/gpflow This is the first time the design and motivation of the project have been discussed ina paper. An up to date list of contributors can be found on the web page. Table of contents List of figures xv List of tables xxi 1 Stochastic processes as priors1 1.1 Stochastic processes . .2 1.1.1 The Kolmogorov extension theorem . .2 1.1.2 Gaussian processes . .3 1.1.3 Examples . .5 1.1.4 Limitations of the product σ-algebra . .6 1.2 Bayesian nonparametrics . .7 1.2.1 Why Bayesian nonparametrics? . .8 1.2.2 Bayes’ theorem for infinite dimensional models . .9 1.2.3 Using Gaussian processes as priors . 11 1.2.4 Hyperparameter selection . 12 1.2.5 Examples of Gaussian process regression . 13 1.2.6 Two challenges of Gaussian process modelling . 13 2 Variational inference 15 2.1 A general definition of -divergence . 15 KL 2.2 Defining tractable approximating families ................. 18 Q 2.2.1 Factorization assumptions . 18 2.2.2 Working with free form distributions . 19 2.2.3 Fixed form assumptions . 20 2.2.4 Prior conditional matching . 21 2.3 Treatment of hyperparameters . 22 2.4 An example of classical variational inference approaches . 24 2.5 An example of prior conditional matching . 25 xii Table of contents 3 On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes 29 3.1 Introduction . 29 3.2 Finite index set case . 32 3.3 Infinite index set case . 34 3.3.1 There is no useful infinite dimensional Lebesgue measure . 34 3.3.2 The -divergence between processes . 35 KL 3.3.3 A general derivation of the sparse inducing point framework . 35 3.3.4 The effect of increasing the number of inducing points . 39 3.4 Augmented index sets . 43 3.4.1 The augmentation argument. 43 3.4.2 The chain rule for -divergences . 45 KL 3.4.3 The augmentation argument is not correct in general . 45 3.5 Variational inducing random variables . 47 3.6 Examples . 48 3.6.1 Variational interdomain approximations . 48 3.6.2 Approximations to Cox process posteriors . 49 3.7 Conclusion . 51 4 Scalable Variational Gaussian Process Classification 53 4.1 Introduction . 53 4.1.1 Non-conjugate likelihoods . 53 4.1.2 Scalability challenges . 54 4.1.3 A specific realization: classification . 54 4.1.4 A summary of the challenge . 55 4.1.5 An outline for the rest of the chapter . 55 4.2 Background . 55 4.2.1 Non-conjugate approximations for GPC . 56 4.2.2 Sparse Gaussian processes . 56 4.2.3 Subset of data methods and subset of data inducing methods . 57 4.2.4 A coherent variational framework . 58 4.2.5 Stochastic variational inference . 60 4.2.6 Bringing the components together . 61 4.3 Sparse variational Gaussian process classification . 61 4.4 Multiclass classification . 66 4.5 Experiments . 68 Table of contents xiii 4.5.1 A critical comparison of the FITC methods and VFE methods for regression data.

Load more