Distribution of Gaussian Process Arc Lengths
JustinD.Bewsher AlessandraTosi MichaelA.Osborne StephenJ.Roberts University of Oxford Mind Foundry, Oxford University of Oxford University of Oxford
Abstract The authors believe that an understanding of the arc length properties of a gp will open up promising av- We present the first treatment of the arc enues of research. Arc length statistics have been used length of the Gaussian Process (gp)with to analyze multivariate time series modelling [21]. In more than a single output dimension. Gps [19], the authors minimize the arc length of a deter- are commonly used for tasks such as trajec- ministic curve which then implicitly defines a gp.In tory modelling, where path length is a crucial another related paper [7], the authors use gps as ap- quantity of interest. Previously, only paths proximations to geodesics and compute arc lengths us- in one dimension have been considered, with ing a na¨ıve Monte Carlo method. no theoretical consideration of higher dimen- We envision the arc length as a cost function in sional problems. We fill the gap in the exist- Bayesian optimisation, as a tool in path planning prob- ing literature by deriving the moments of the lems [11] and a way to construct meaningful features arc length for a stationary gp with multiple from functional data. output dimensions. A new method is used n to derive the mean of a one-dimensional gp Consider a Euclidean space X = R and a di↵eren- tiable injective function :[0,T] Rn.Thenthe over a finite interval, by considering the dis- ! tribution of the arc length integrand. This image of the curve, , is a curve with length: technique is used to derive an approximate T distribution over the arc length of a vector length( )= 0(t) dt. (1) 0 | | valued gp in Rn by moment matching the Z distribution. Numerical simulations confirm Importantly, the length of the curve is independent of our theoretical derivations. the choice of parametrization of the curve [4]. For the specific case where X = R2, with the parametrization in terms of t, =(y(t),x(t)), we have: 1INTRODUCTION T T 2 2 length( )= 0(t) dt = y0(t) + x0(t) dt. 0 | | 0 Gaussian Processes (gps) [22] are a ubiquitous tool Z Z p (2) in machine learning. They provide a flexible non- If we can write y = f(x), x = t, then our expression parametric approach to non-linear data modelling. reduces to the commonly known expression for the arc Gps have been used in a number of machine learn- length of a function: ing problems, including latent variable modelling [10], T T dynamical time-series modelling [20] and Bayesian op- 2 length( )=s = 0(t) dt = 1+(f 0(t)) dt, timisation [18]. 0 | | 0 Z Z q (3) At present, a gap exists in the literature; we fill that where we have introduced s as a shorthand for the gap by providing the first analysis of the moments of length of our curve. In some cases the exact form of s the arc length of a vector-valued gp. Previous work can be computed, for more complicated curves, such as tackles only the univariate case, making it inapplica- Beizer and splines curves, we must appeal to numerical ble to important applications in, for example, medical methods to compute the length. vision (or brain imaging) [6] and path planning [15]. Our interest lies in considering the length of a function Proceedings of the 20th International Conference on Artifi- modelled with a gp. Intuition suggests that the length cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- of a gp will concentrate around the mean function with erdale, Florida, USA. JMLR: W&CP volume 54. Copy- the statistical properties dictated by the choice of ker- right 2017 by the author(s). nel and the corresponding hyperparameters. Distribution of Gaussian Process Arc Lengths
Previous work [2] considered the derivative process The derivative of the posterior mean can be calculated: f 0(t), which is itself a gp. A direct calculation was @m @k(x ,X) 2 1 performed and an exact form for the mean was ob- ⇤ = ⇤ (k(X, X)+ I) y, (8) @x @x tained in terms of modified Bessel functions; a form ⇤ ⇤ for the variance is also presented. This result is also wherever the derivative of the kernel function can be derived in [14, 3]. Analysis has been presented on the calculated. The covariance for the derivative process arc length of a high-level excursion from the mean [16]. can likewise be derived and hence a full distribution over the derivative process can be specified. However, other important questions have not been ex- plored within the literature. In particular, the shape Alternatively, a derivative gp process can be defined of the distribution has not been communicated, com- for any twice-di↵erentiable kernel in terms of our prior puting the arc length of a posterior gps has not been distribution. If f GP(µ, k), then we write the addressed, nor has anyone considered the arc length derivative process as⇠ of gps in anything other than R. These issues are ad- 2 f 0 GP(@µ, @ k). (9) dressed within this paper; we present a new derivation ⇠ of the mean of a one dimensional gp and derive the The auto-correlation, ⇢(⌧), of the derivative process is n moments of a gp in R . related to the autocorrelation of the original process The paper is structured as follows. In Section 2 we via the relation [13]: review the theory of gps and introduce the notation 2 2 n d @ necessary to deal with gps defined on . In Section 3 ⇢ (⌧)= ⇢ (⌧)= k(x x0), (10) R f 0 d⌧ 2 f @x@x we examine the one dimensional case, deriving a distri- 0 bution over the arc length increment before computing where ⌧ = x x0. The variance of f 0 is therefore given by 2 = ⇢ (0). the mean and variance of the arc length. In Section f 0 f 0 4 we consider the general case. A closed form distri- bution is not possible for the increment, therefore we 2.2 Vector Valued Gaussian Processes provide a moment-matched approximation that proves high-fidelity to the true distribution. This distribution The development of the multi-output gpsproceedsina allows us to compute the corresponding moments for manner similar to the single output case; for a detailed the arc length. Section 5 presents numerical simula- review see [1]. The outputs are random variables asso- tions demonstrating the theoretical results. Finally we ciated with di↵erent processes evaluated at potentially conclude with thoughts on the use of arc length priors. di↵erent values of x. We consider a vector valued gp: f GP(m, K), (11) 2 GAUSSIAN PROCESSES ⇠ D D where m R is the mean vector where md(x) 2 { }d=1 2.1 Single Output Gaussian Processes are mean functions associated with each output and K is now a positive definite matrix valued function. Consider a stochastic process from a domain f : (K(x, x )) is the covariance between f (x) and X ! 0 d,d0 d R.Theniff is a gp, with mean function µ and kernel fd0 (x0). Given input X, our prior over f(X)isnow k,wewrite f(X) (m(X), K(X, X)). (12) f GP(µ, k). (4) ⇠ N ⇠ m(X)isaDN-length vector that concatenates the We can think of a gp as an extension of the multi- mean vectors for each output and K(X, X)isaND variate Gaussian distribution for function values and ND block partitioned matrix. In the vector valued⇥ as the multivariate case, a gp is completely speci- case the predictive equations for an unseen datum, x fied by its mean and covariance function. For a de- ⇤ become: tailed introduction see [22]. Given a set of observations N 2 T 1 S = xi,yi i=1 with Gaussian noise the posterior m(x )=Kx (K(X, X)+⌃) y (13) { } ⇤ ⇤ distribution for an unseen data, x ,is T 1 ⇤ C (x , x )=K(x , x ) Kx (K(X, X)+⌃) Kx , p(f(x ) S, x , )= (f(x ),m(x ),k(x ,x )). (5) ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ (14)⇤ ⇤ | ⇤ N ⇤ ⇤ ⇤ ⇤ Letting X =[x1,...,xn], and defining kx = where ⌃ is block diagonal matrix with the prior noise ⇤ K(X, x ), the posterior mean and covariance are: of each output along the diagonal. The problem now ⇤ T 2 1 focuses on specifying the form of the covariance matrix m(x )=kx (k(X, X)+ I) y (6) ⇤ ⇤ T 2 1 K. We are interested in separable kernels of the form: C (x ,x )=k(x ,x ) kx (k(X, X)+ I) kx ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ (7) K(x, x0)d,d0 = k(x, x0)kT (d, d0), (15) Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts
where k and kT are themselves valid kernels. The ker- 3.1 Integrand Distribution nel can then be specified in the form: We present a new method for deriving the mean and K(x, x0)=k(x, x0)B (16) variance of the arc length of a one-dimensional gp by first considering the transformation of a normal dis- where B is a D D matrix. For a data set X: ⇥ tribution variable under the non-linear transformation K(X, X)=B k(X, X), (17) g(x)=(1+x)1/2. Specifically, we can consider the ⌦ distribution of a normally distributed random variable with representing the Kronecker product. B speci- under the transformation g: fies the⌦ degree of correlation between the outputs. Var- ious choices of B result in what is known as the Intrin- Y = g(X)= 1+X2,X (µ, 2). (19) ⇠ N sic Model of Coregionalisation (IMC) or Linear Model p of Coregionalisation (LMC). We consider the more general case where µ = 0. Intu- itively, we expect our distribution for Y to be6 a skewed Chi distribution. We are able to directly compute the 2.3 Kernel Choices probability density function for Y by considering the A gp prior is specified by a choice of kernel, which cumulative distribution and using the standard rules encodes our belief about the nature of our function for the transformation of probability functions: behaviour. In the case of infinitely di↵erentiable func- P (Y