Deep Kernel Learning

Deep Kernel Learning Andrew Gordon Wilson∗ Zhiting Hu∗ Ruslan Salakhutdinov Eric P. Xing CMU CMU University of Toronto CMU Abstract (1996), who had shown that Bayesian neural networks with infinitely many hidden units converged to Gaussian processes with a particular kernel (co- We introduce scalable deep kernels, which variance) function. Gaussian processes were subse- combine the structural properties of deep quently viewed as flexible and interpretable alterna- learning architectures with the non- tives to neural networks, with straightforward learn- parametric flexibility of kernel methods. ing procedures. Where neural networks used finitely Specifically, we transform the inputs of a many highly adaptive basis functions, Gaussian pro- spectral mixture base kernel with a deep cesses typically used infinitely many fixed basis func- architecture, using local kernel interpolation, tions. As argued by MacKay (1998), Hinton et al. inducing points, and structure exploit- (2006), and Bengio (2009), neural networks could ing (Kronecker and Toeplitz) algebra for automatically discover meaningful representations in a scalable kernel representation. These high-dimensional data by learning multiple layers of closed-form kernels can be used as drop-in highly adaptive basis functions. By contrast, Gaus- replacements for standard kernels, with ben- sian processes with popular kernel functions were used efits in expressive power and scalability. We typically as simple smoothing devices. jointly learn the properties of these kernels through the marginal likelihood of a Gaus- Recent approaches (e.g., Yang et al., 2015; Lloyd et al., sian process. Inference and learning cost 2014; Wilson, 2014; Wilson and Adams, 2013) have (n) for n training points, and predictions demonstrated that one can develop more expressive costO (1) per test point. On a large and kernel functions, which are indeed able to discover rich diverseO collection of applications, including structure in data without human intervention. Such a dataset with 2 million examples, we show methods effectively use infinitely many adaptive ba- improved performance over scalable Gaus- sis functions. The relevant question then becomes not sian processes with flexible kernel learning which paradigm (e.g., kernel methods or neural net- models, and stand-alone deep architectures. works) replaces the other, but whether we can combine the advantages of each approach. Indeed, deep neural networks provide a powerful mechanism for creating 1 Introduction adaptive basis functions, with inductive biases which have proven effective for learning in many application “How can Gaussian processes possibly replace neural domains, including visual object recognition, speech networks? Have we thrown the baby out with the perception, language understanding, and information bathwater?” questioned MacKay (1998). It was the retrieval (Krizhevsky et al., 2012; Hinton et al., 2012; late 1990s, and researchers had grown frustrated with Socher et al., 2011; Kiros et al., 2014; Xu et al., 2015). the many design choices associated with neural net- In this paper, we combine the non-parametric flexi- works – regarding architecture, activation functions, bility of kernel methods with the structural proper- and regularisation – and the lack of a principled frame- ties of deep neural networks. In particular, we use work to guide in these choices. deep feedforward fully-connected and convolutional Gaussian processes had recently been popularised networks, in combination with spectral mixture co- within the machine learning community by Neal variance functions (Wilson and Adams, 2013), inducing points (Quiñonero-Candelaand Rasmussen, 2005), structure exploiting algebra (Saatchi, 2011), and lo- ∗Authors contributed equally. Appearing in Proceedings of the 19th International Conference on Artificial Intelligence cal kernel interpolation (Wilson and Nickisch, 2015; and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: Wilson et al., 2015), to create scalable expressive W&CP volume 51. Copyright 2016 by the authors. closed form covariance kernels for Gaussian processes. 370 Deep Kernel Learning As a non-parametric method, the information capac- with a Gaussian process transformation, in an unsu- ity of our model grows with the amount of avail- pervised setting. While promising, both models are able data, but its complexity is automatically cali- very task specific, and require sophisticated approxi- brated through the marginal likelihood of the Gaus- mate Bayesian inference which is much more demand- sian process, without the need for regularization or ing than what is required by standard Gaussian pro- cross-validation (Rasmussen and Ghahramani, 2001; cesses or deep learning models, and typically does not Rasmussen and Williams, 2006; Wilson, 2014). The scale beyond a few thousand training points. Similarly, flexibility and automatic calibration provided by the Salakhutdinov and Hinton (2008) combine deep belief non-parametric layer typically provides a high stan- networks (DBNs) with Gaussian processes, showing dard of performance, while reducing the need for ex- improved performance over standard GPs with RBF tensive hand tuning from the user. kernels, in the context of semi-supervised learning. However, their model is heavily relying on unsuper- We further build on the ideas in KISS-GP (Wilson and vised pre-training of DBNs, with the GP component Nickisch, 2015) and extensions (Wilson et al., 2015), so unable to scale beyond a few thousand training points. that our deep kernel learning model can scale linearly Likewise, Calandra et al. (2014) combine a feedfor- with the number of training instances n, instead of ward neural network transformation with a Gaussian (n3) as is standard with Gaussian processes (GPs), process, showing an ability to learn sharp discontinu- whileO retaining a fully non-parametric representation. ities. However, similar to many other approaches, the Our approach also scales as (1) per test point, in- resulting model can only scale to at most a few thou- stead of the standard (n2) forO GPs, allowing for very sand data points. fast prediction times.O Because KISS-GP creates an approximate kernel from a user specified kernel for In a frequentist setting, Yang et al. (2014) combine fast computations, independently of a specific infer- convolutional networks, with parameters pre-trained ence procedure, we can view the resulting kernel as a on ImageNet, with a scalable Fastfood (Le et al., 2013) scalable deep kernel. We demonstrate the value of this expansion for the RBF kernel applied to the final layer. scalability in the experimental results section, where it The resulting method is scalable and flexible, but the is the large datasets that provide the greatest oppor- network parameters generally must first be trained tunities for our model to discover expressive statistical separately from the Fastfood features, and the com- representations. bined model remains parametric, due to the parametric expansion provided by Fastfood. Careful atten- We begin by reviewing related work in section 2, and tion must still be paid to training procedures, regu- providing background on Gaussian processes in section larization, and manual calibration of the network ar- 3. In section 4 we derive scalable closed form deep ker- chitecture. In a similar manner, Huang et al. (2015) nels, and describe how to perform efficient automatic and Snoek et al. (2015) have combined deep architec- learning of these kernels through the Gaussian process tures with parametric Bayesian models. Huang et al. marginal likelihood. In section 5, we show substan- (2015) pursue an unsupervised pre-training procedure tially improved performance over standard Gaussian using deep autoencoders, showing improved perfor- processes, expressive kernel learning approaches, deep mance over GPs using standard kernels. Snoek et al. neural networks, and Gaussian processes applied to (2015) show promising performance on Bayesian op- the outputs of trained deep networks, on a wide range timisation tasks, for tuning the parameters of a deep of datasets. We also interpret the learned kernels to neural network, but do not use a Bayesian (marginal gain new insights into our modelling problems. likelihood) objective for training network parameters. Our approach is distinct in that we combine deep feed- 2 Related Work forward and convolutional architectures with spectral mixture covariances (Wilson and Adams, 2013), in- Given the intuitive value of combining kernels and neu- ducing points, Kronecker and Toeplitz algebra, and ral networks, it is encouraging that various distinct local kernel interpolation (Wilson and Nickisch, 2015; forms of such combinations have been considered in Wilson et al., 2015), to derive expressive and scalable different contexts. closed form kernels, where all parameters are trained The Gaussian process regression network (Wilson jointly with a unified supervised objective, as part of a et al., 2012) replaces all weight connections and activa- non-parametric Gaussian process framework, without tion functions in a Bayesian neural network with Gaus- requiring approximate Bayesian inference. Moreover, sian processes, allowing the authors to model input the simple joint learning procedure in our approach dependent correlations between multiple tasks. Alter- can be applied in general settings. Indeed we show natively, Damianou and Lawrence (2013) replace ev- that the proposed model outperforms state of the art ery activation function in a Bayesian neural

Deep Kernel Learning

Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD

3.2 the Periodogram

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Consistent Kernel Mean Estimation for Functions of Random Variables

Density and Distribution Estimation

Recurrent Kernel Networks

Nonparametric Density Estimation Histogram

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation

Regression Discontinuity Design

6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel

Kernel Probability Estimation for Binomial and Multinomial Data Greg Jensen1