Arxiv:2010.15538V3 [Stat.ML] 9 Apr 2021 Settings

MatérnGaussian Processes on Graphs Viacheslav Borovitskiy∗1; 5 Iskander Azangulov∗1 Alexander Terenin∗2 Peter Mostowsky1 Marc Peter Deisenroth3 Nicolas Durrande4 1St. Petersburg State University 2Imperial College London 3Centre for Artificial Intelligence, University College London 4Secondmind 5St. Petersburg Department of Steklov Mathematical Institute of Russian Academy of Sciences Abstract kinds of prior information about the function they seek to approximate. For example, by choosing different covariance kernels, one can encode different degrees of Gaussian processes are a versatile framework differentiability, or specific patterns, such as periodicity for learning unknown functions in a manner and symmetry. Although the input and output spaces that permits one to utilize prior information of GP models are typically subsets of or d, this is about their properties. Although many dif- R R by no means a restriction of the GP framework, and ferent Gaussian process models are readily it is possible to define models for other types of input available when the input space is Euclidean, or output spaces (Lindgren et al., 2011; Mallasto and the choice is much more limited for Gaussian Feragen, 2018; Borovitskiy et al., 2020). processes whose input space is an undirected graph. In this work, we leverage the stochas- In many applications, such as predicting street conges- tic partial differential equation characteriza- tion within a road network, using kernels based on the tion of MatérnGaussian processes|a widely- Euclidean distance between two locations in the city used model class in the Euclidean setting|to does not make much sense. In particular, locations study their analog for undirected graphs. We that are spatially close may have different traffic pat- show that the resulting Gaussian processes terns, for instance, if two nearby roads are disconnected inherit various attractive properties of their or if traffic is present only in one direction of travel. Euclidean and Riemannian analogs and pro- Here, it is much more natural for the model to account vide techniques that allow them to be trained directly for a distance based on the graph structure. using standard methods, such as inducing In this work, we study GPs whose inputs or outputs points. This enables graph MatérnGaussian are indexed by the vertices of an undirected graph, processes to be employed in mini-batch and where each edge between adjacent nodes is assigned a non-conjugate settings, thereby making them positive weight. more accessible to practitioners and easier to Gaussian Markov random fields (GMRF) (Rue and deploy within larger learning frameworks. Held, 2005) provide a canonical framework for such arXiv:2010.15538v3 [stat.ML] 9 Apr 2021 settings. A GMRF builds a graph GP by introducing a Markov structure on the graph's vertices, and results 1 Introduction in models and algorithms that are computationally efficient. These constructions are well-defined and ef- Gaussian process (GP) models have become ubiquitous fective, but they require Markovian assumptions, which in machine learning, and have been shown to be a limit model flexibility. Although it may be tempting data efficient approach in a wide variety of applications to replace the Euclidean distance, which can typically (Rasmussen and Williams, 2006). Key elements behind be found in the expression of stationary kernels, by the success of GP models include their ability to assess the graph distance, this typically does not result in a and propagate uncertainty, as well as encode different well-defined covariance kernel (Feragen et al., 2015).1 Proceedings of the 24th International Conference on Artifi- ∗Equal contribution. Code available at: http://github. cial Intelligence and Statistics (AISTATS) 2021, San Diego, com/spbu-math-cs/Graph-Gaussian-Processes. Cor- California, USA. PMLR: Volume 130. Copyright 2021 by respondence to: [email protected]. the author(s). MatérnGaussian Processes on Graphs As a consequence, working with GPs on graphs requires prior mean µ is zero. one to define bespoke kernels, which, for graphs with For a given set of training data (x ; y ), we define the finite sets of nodes, can be viewed as parameterized i i model y = f(x ) + " where f ∼ GP(0; k) and " ∼ structured covariance matrices that encode dependence i i i i N(0; σ2). The posterior of f given the observations is between vertices. A few covariance structures dedicated another GP. Its conditional mean and covariance are to graph GPs have been explored in the literature, such as the diffusion kernel or random walk kernels (Kondor 2 −1 µjy(·) = K·x(Kxx + σ I) y (1) and Lafferty, 2002; Vishwanathan et al., 2010), but 0 0 2 −1 the available choices are limited compared to typical kjy(·; · ) = k(·; · ) − K·x(Kxx + σ I) Kx·0 (2) Euclidean input spaces and this results in impaired which uniquely characterize the posterior distribution modeling abilities. (Rasmussen and Williams, 2006). Following Wilson In this work, we study graph analogs of kernels from the et al. (2020), posterior sample paths can be written as Matérnfamily, which are among the most commonly 2 −1 used kernels for Euclidean input spaces (Rasmussen f(·) j y = f(·)+K·x(Kxx +σ I) (y−f(x)−") (3) and Williams, 2006; Stein, 1999). These can be used as where (·) is an arbitrary set of locations. GP input covariances or GP output cross-covariances. Our approach is related to Whittle (1963), Lindgren 2.1 The Matérnkernel et al. (2011), and Borovitskiy et al. (2020), where GPs with Matérnkernels are defined on Euclidean and d 0 When X = R and τ = x − x , Matérnkernels are Riemannian manifolds via their stochastic partial differ- defined as ential equation (SPDE) representation. For the graph ν MatérnGPs with integer smoothness parameters we ob- 21−ν p kτk p kτk k (x; x0) = σ2 2ν K 2ν (4) tain sparse precision matrices that can be exploited for ν Γ(ν) κ ν κ improving the computational speed. For example, they can benefit from the well-established GMRF frame- where Kν is the modified Bessel function of the second work (Rue and Held, 2005) or from recent advances kind (Gradshteyn and Ryzhik, 2014). The parameters 2 in non-conjugate GP inference on graphs (Durrande σ , κ and ν are positive scalars that have a natural 2 et al., 2019). As an alternative, we present a Fourier interpretation: σ is the variance of the GP, the length- feature approach to building Matérnkernels on graphs scale κ controls how distances are measured in the input with its own set of advantages, such as hyperparameter space, and ν determines mean-square differentiability optimization without incurring the cost of computing of the GP (Rasmussen and Williams, 2006). As ν ! 1, a matrix inverse at each optimization step. the Matérnkernel converges to the widely-used squared exponential kernel We also discuss important properties of the graph Matérnkernels, such as their convergence to the Eu- kτk2 k (x; x0) = σ2 exp − : (5) clidean and Riemannian Matérnkernels when the graph 1 2κ2 becomes more and more dense. In particular, this al- lows graph Matérnkernels to be used as a substitute Euclidean Matérnkernels are known to possess favor- for manifold Matérnkernels (Borovitskiy et al., 2020) able asymptotic properties in the large-data regime| when the manifold in question is obtained as intermedi- see Stein (2010) and Kaufman and Shaby (2013) for ate output of an upstream manifold learning algorithm details. in the form of a graph. In our setting, an important property of Matérnkernels is their connection to stochastic partial differential 2 Gaussian processes equations (SPDEs). Whittle (1963) has shown that MatérnGPs on X = Rd satisfies the SPDE Let X be a set. A random function f : X ! is a ν + d R 2ν 2 4 Gaussian process f ∼ GP(µ, k) with mean function µ(·) 2 − ∆ f = W (6) and kernel k(·; ·) if, for any finite set of points x 2 Xn, κ the random vector f(x) is multivariate Gaussian with for ν < 1, where ∆ is the Laplacian and W is Gaussian mean vector µ = µ(x) and covariance matrix Kxx = white noise (Lifshits, 2012) re-normalized by a certain k(x; x). Without loss of generality, we assume that the constant|see Lindgren et al. (2011) or Borovitskiy et al. (2020) for details. Similarly, the limiting squared 1Approaches of this type can still be employed, provided one introduces additional restrictions on the class of graphs exponential GP satisfies and/or on allowable ranges for parameters such as length 2 − κ ∆ scale. See Anderes et al. (2020) for an example. e 4 f = W (7) V. Borovitskiy, I. Azangulov, A. Terenin, P. Mostowsky, M. P. Deisenroth, and N. Durrande 2 − κ ∆ where e 4 is the (rescaled) heat semigroup (Evans, Let G be a weighted undirected graph whose weights 2010; Grigoryan, 2009). These equations have been are non-negative|for an unweighted graph, assume studied as a means to extend MatérnGaussian pro- all weights are equal to one. Denote its adjacency cesses to Riemannian manifolds, such as the sphere and matrix by W, its diagonal degree matrix by D with P torus (Lindgren et al., 2011; Borovitskiy et al., 2020). Dii = j Wij, and define the graph Laplacian by Since these spaces can be discretized to form a graph, ∆ = D − W: (8) we explore the relationship between these settings in the sequel. The graph Laplacian is a symmetric, positive semi- definite matrix, which we view as a linear operator 2.2 Gaussian processes on graphs acting on a jV j-dimensional real space. Note that this operator should be viewed as an analog of −∆ in the A number of approaches have been proposed to define Euclidean or Riemannian setting.2 Notwithstanding GPs over a weighted undirected graph G = (V; E).

Arxiv:2010.15538V3 [Stat.ML] 9 Apr 2021 Settings

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support