Distribution of Gaussian Process Arc Lengths

JustinD.Bewsher AlessandraTosi MichaelA.Osborne StephenJ.Roberts University of Oxford Mind Foundry, Oxford University of Oxford University of Oxford

Abstract The authors believe that an understanding of the arc length properties of a gp will open up promising av- We present the first treatment of the arc enues of research. Arc length statistics have been used length of the Gaussian Process (gp)with to analyze multivariate time series modelling [21]. In more than a single output dimension. Gps [19], the authors minimize the arc length of a deter- are commonly used for tasks such as trajec- ministic curve which then implicitly defines a gp.In tory modelling, where path length is a crucial another related paper [7], the authors use gps as ap- quantity of interest. Previously, only paths proximations to geodesics and compute arc lengths us- in one dimension have been considered, with ing a na¨ıve Monte Carlo method. no theoretical consideration of higher dimen- We envision the arc length as a cost function in sional problems. We fill the gap in the exist- Bayesian optimisation, as a tool in path planning prob- ing literature by deriving the moments of the lems [11] and a way to construct meaningful features arc length for a stationary gp with multiple from functional data. output dimensions. A new method is used n to derive the mean of a one-dimensional gp Consider a Euclidean space X = R and a di↵eren- tiable injective function :[0,T] Rn.Thenthe over a finite interval, by considering the dis- ! tribution of the arc length integrand. This image of the curve, , is a curve with length: technique is used to derive an approximate T distribution over the arc length of a vector length()= 0(t) dt. (1) 0 | | valued gp in Rn by moment matching the Z distribution. Numerical simulations confirm Importantly, the length of the curve is independent of our theoretical derivations. the choice of parametrization of the curve [4]. For the specific case where X = R2, with the parametrization in terms of t, =(y(t),x(t)), we have: 1INTRODUCTION T T 2 2 length()= 0(t) dt = y0(t) + x0(t) dt. 0 | | 0 Gaussian Processes (gps) [22] are a ubiquitous tool Z Z p (2) in machine learning. They provide a flexible non- If we can write y = f(x), x = t, then our expression parametric approach to non-linear data modelling. reduces to the commonly known expression for the arc Gps have been used in a number of machine learn- length of a function: ing problems, including latent variable modelling [10], T T dynamical time-series modelling [20] and Bayesian op- 2 length()=s = 0(t) dt = 1+(f 0(t)) dt, timisation [18]. 0 | | 0 Z Z q (3) At present, a gap exists in the literature; we fill that where we have introduced s as a shorthand for the gap by providing the first analysis of the moments of length of our curve. In some cases the exact form of s the arc length of a vector-valued gp. Previous work can be computed, for more complicated curves, such as tackles only the univariate case, making it inapplica- Beizer and splines curves, we must appeal to numerical ble to important applications in, for example, medical methods to compute the length. vision (or brain imaging) [6] and path planning [15]. Our interest lies in considering the length of a function Proceedings of the 20th International Conference on Artifi- modelled with a gp. Intuition suggests that the length cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- of a gp will concentrate around the mean function with erdale, Florida, USA. JMLR: W&CP volume 54. Copy- the statistical properties dictated by the choice of ker- right 2017 by the author(s). nel and the corresponding hyperparameters. Distribution of Gaussian Process Arc Lengths

Previous work [2] considered the derivative process The derivative of the posterior mean can be calculated: f 0(t), which is itself a gp. A direct calculation was @m @k(x ,X) 2 1 performed and an exact form for the mean was ob- ⇤ = ⇤ (k(X, X)+ I) y, (8) @x @x tained in terms of modified Bessel functions; a form ⇤ ⇤ for the is also presented. This result is also wherever the derivative of the kernel function can be derived in [14, 3]. Analysis has been presented on the calculated. The covariance for the derivative process arc length of a high-level excursion from the mean [16]. can likewise be derived and hence a full distribution over the derivative process can be specified. However, other important questions have not been ex- plored within the literature. In particular, the shape Alternatively, a derivative gp process can be defined of the distribution has not been communicated, com- for any twice-di↵erentiable kernel in terms of our prior puting the arc length of a posterior gps has not been distribution. If f GP(µ, k), then we write the addressed, nor has anyone considered the arc length derivative process as⇠ of gps in anything other than R. These issues are ad- 2 f 0 GP(@µ, @ k). (9) dressed within this paper; we present a new derivation ⇠ of the mean of a one dimensional gp and derive the The auto-correlation, ⇢(⌧), of the derivative process is n moments of a gp in R . related to the autocorrelation of the original process The paper is structured as follows. In Section 2 we via the relation [13]: review the theory of gps and introduce the notation 2 2 n d @ necessary to deal with gps defined on . In Section 3 ⇢ (⌧)= ⇢ (⌧)= k(x x0), (10) R f 0 d⌧ 2 f @x@x we examine the one dimensional case, deriving a distri- 0 bution over the arc length increment before computing where ⌧ = x x0. The variance of f 0 is therefore given by 2 = ⇢ (0). the mean and variance of the arc length. In Section f 0 f 0 4 we consider the general case. A closed form distri- bution is not possible for the increment, therefore we 2.2 Vector Valued Gaussian Processes provide a moment-matched approximation that proves high-fidelity to the true distribution. This distribution The development of the multi-output gpsproceedsina allows us to compute the corresponding moments for manner similar to the single output case; for a detailed the arc length. Section 5 presents numerical simula- review see [1]. The outputs are random variables asso- tions demonstrating the theoretical results. Finally we ciated with di↵erent processes evaluated at potentially conclude with thoughts on the use of arc length priors. di↵erent values of x. We consider a vector valued gp: f GP(m, K), (11) 2 GAUSSIAN PROCESSES ⇠ D D where m R is the mean vector where md(x) 2 { }d=1 2.1 Single Output Gaussian Processes are mean functions associated with each output and K is now a positive definite matrix valued function. Consider a stochastic process from a domain f : (K(x, x )) is the covariance between f (x) and X ! 0 d,d0 d R.Theniff is a gp, with mean function µ and kernel fd0 (x0). Given input X, our prior over f(X)isnow k,wewrite f(X) (m(X), K(X, X)). (12) f GP(µ, k). (4) ⇠ N ⇠ m(X)isaDN-length vector that concatenates the We can think of a gp as an extension of the multi- mean vectors for each output and K(X, X)isaND variate Gaussian distribution for function values and ND block partitioned matrix. In the vector valued⇥ as the multivariate case, a gp is completely speci- case the predictive equations for an unseen datum, x fied by its mean and covariance function. For a de- ⇤ become: tailed introduction see [22]. Given a set of observations N 2 T 1 S = xi,yi i=1 with Gaussian noise the posterior m(x )=Kx (K(X, X)+⌃) y (13) { } ⇤ ⇤ distribution for an unseen data, x ,is T 1 ⇤ C (x , x )=K(x , x ) Kx (K(X, X)+⌃) Kx , p(f(x ) S, x , )= (f(x ),m(x ),k(x ,x )). (5) ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ (14)⇤ ⇤ | ⇤ N ⇤ ⇤ ⇤ ⇤ Letting X =[x1,...,xn], and defining kx = where ⌃ is block diagonal matrix with the prior noise ⇤ K(X, x ), the posterior mean and covariance are: of each output along the diagonal. The problem now ⇤ T 2 1 focuses on specifying the form of the covariance matrix m(x )=kx (k(X, X)+ I) y (6) ⇤ ⇤ T 2 1 K. We are interested in separable kernels of the form: C (x ,x )=k(x ,x ) kx (k(X, X)+ I) kx ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ ⇤ (7) K(x, x0)d,d0 = k(x, x0)kT (d, d0), (15) Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

where k and kT are themselves valid kernels. The ker- 3.1 Integrand Distribution nel can then be specified in the form: We present a new method for deriving the mean and K(x, x0)=k(x, x0)B (16) variance of the arc length of a one-dimensional gp by first considering the transformation of a normal dis- where B is a D D matrix. For a data set X: ⇥ tribution variable under the non-linear transformation K(X, X)=B k(X, X), (17) g(x)=(1+x)1/2. Specifically, we can consider the ⌦ distribution of a normally distributed random variable with representing the Kronecker product. B speci- under the transformation g: fies the⌦ degree of correlation between the outputs. Var- ious choices of B result in what is known as the Intrin- Y = g(X)= 1+X2,X (µ, 2). (19) ⇠ N sic Model of Coregionalisation (IMC) or Linear Model p of Coregionalisation (LMC). We consider the more general case where µ = 0. Intu- itively, we expect our distribution for Y to be6 a skewed . We are able to directly compute the 2.3 Kernel Choices probability density function for Y by considering the A gp prior is specified by a choice of kernel, which cumulative distribution and using the standard rules encodes our belief about the nature of our function for the transformation of probability functions: behaviour. In the case of infinitely di↵erentiable func- P (Y

Figure 1: Histogram of samples from p1+X2,whereX (µ, ⌃), overlaid with the corresponding distribution. We display the e↵ects of varying µ and . ⇠ N

3.2 Arc Length Statistics where we have used the expectation of the integrand for the zero-mean case. Using identities related to the Having derived expressions for the arc length integrand Confluent Hypergeometric we can rewrite the mean as: distribution we are able to evaluate the moments of the T exp(1/42 ) arc length. Specifically, we consider a zero mean gp f 0 1 1 E[s]= BF0 +BF1 , with kernel K: p 42 42 2 2⇡f 0 " f 0 ! f 0 !# f (0, K) (24) (30) ⇠ N The derivative process is a gp [22] defined by: where BFi is the modified Bessel function of the second kind of order i. For a posterior distribution of the arc 2 f 0 (0, @ K) (25) length, given data observations, we would use Eqn 69 ⇠ N along with the the posterior derivative mean, µf 0 = Taking the expectation of the arc length, noting that @m 2 ⇤ and variance function of the posterior gp, to the integrand is non-negative, therefore by Fubini’s @x f 0 compute⇤ the expected length; Theorem [5] we can interchange the expectation and integral: 1 T 2 1 l + 1 µf E[s]= 2 exp 0 T (2l)! p2⇡ 22 l=0 0 f 0 f 0 ! [s]= 1+(f )2dt (26) X Z E E 0 2l "Z0 # µ 1 1 p f 0 T 2 U l + ,l+2, 2 dt, (31) 2 f ! 2 2f ! = E 1+(f 0) dt. (27) 0 0 0 Z h i where µf and f depend on t.Wehavederiveda p 2 0 0 The variance of f 0 is given by = R (0). At each f 0 f 0 closed form expression for the mean of the arc length point along the integral the expectation of the inte- of a one dimensional zero mean gp,reproducingthe grand is the same, therefore we arrive at: original result from [2] whilst providing a way to com- pute the arc length mean of a gp posterior distribu- T 1 1 1 1 tion. The variance involves the computation of the E[s]= U , 2, 2 dt (28) 0 p2⇡ 2 2 2 Z " f 0 ✓ ◆ f 0 !# second moment, a calculation involving the bi-variate 1 1 1 1 form of the integrand distribution; we do not derive = T U , 2, , (29) that in this paper. An alternate derivation is reported p2⇡ 2 2 22 ✓ ◆ f 0 ! in [2]. Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

3.2.1 Kernel Derivatives present an approximation to the arc length integrand distribution and use this to compute the moments of The value of the mean arc length is determined solely the arc length. For the vector case, we now consider a by the derivative variance, 2 . For stationary kernels, f 0 vector gp and its corresponding derivative process: k(x, x0)=k(x x0), this equates to: 2 f GP(0, K), f 0 GP(0, @ K), (33) 2 2 @ ⇠ ⇠ = k(x x0) . (32) f 0 @x@x0 where K =B k, with a coregionalised matrix B and x=x0 a stationary kernel⌦ k. The arc length for the vector Table 1 summarises a table of common kernels [22] case is given by: and the variance of the e↵ective length scale in terms of their hyperparameters. The e↵ect of the choice of b s = f 0 dt. (34) hyperparameters on the expected length is shown in a | | Figure 2. Z As we did in the one-dimensional case we first consider the distribution of the arc length integrand f 0 and Table 1: Derivative process variance, 2, in terms of | | f˙ then use this to derive the moments of the arc length kernel hyperparameters for a range of common kernels. itself. In each case 2 is the output (signal) variance hyper- parameter and is the input dimension length scale 4.1 Integrand Distribution hyperparameter. We are interested in the distribution over the arc Square Mat´ern, Mat´ern, Rational length. Ultimately we are interested in R3,however, 3 3 n Exponential ⌫ = 2 ⌫ = 2 Quadratic the theory we present is valid for any R . We consider 2/2 32/2 52/32 2/2 the random variable W, defined by:

n W= x =(xT x)1/2 = x2 (35) | | v i u i uX x (µ, ⌃), t (36) ⇠ N n n n with x,µ R and ⌃ R ⇥ is a full-rank covariance matrix. This2 is the square2 root of the sum of squares of correlated normal variables. It is well know that the sum of squares of independent identically distributed normal variables is Chi-squared distributed and that the corresponding square root is Chi distributed [9]. At first glance it seems that we should easily be able to identify this transformed distribution, however, the full-covariance between the elements of x hinder the derivation of a straightforward distribution. Substantial work has been done on the distribution of quadratic forms, Q(x)=xT Ax [12], where x is an n 1 Figure 2: Values of the expected arc length (colour normal vector defined previously and A is a symmetric⇥ shading) for various values of the SE kernel parame- n n matrix. It is possible to write: ters. The plot shows the heat map of the log of the arc ⇥ n length to show sucient detail. The length is domi- Q(x)=xT Ax = (U + b )2, (37) nated by the input scale parameter. i i i i X where the Ui are i.i.d. normal variables with zero mean and unit variance, the i are the eigenvalues of T 1 ⌃ and bi is the ith component of b = P ⌃ 2 µ,withP 4 MULTI-DIMENSIONAL ARC 1 1 LENGTH a matrix that diagonalises ⌃ 2 A⌃ 2 . Observing the summation of the quadratic form in In this section we present the first treatment of the arc Eqn 37, we see that our distribution is a weighted length of a gp in more than one output dimension. We sum of Chi-squared variables. Unfortunately, there Distribution of Gaussian Process Arc Lengths exists no simple closed-form solution for this distribu- The mean and variance are: tion, however, it is possible to express this distribution 1 (m + 1 ) ⌦ 2 via power-series of Laguerre polynomials and some ap- E[W] = 2 (46) (m) m proximations have been used [12]. ✓ ◆ 2 1 (m + 1 ) We note that a Chi-squared distribution is a gamma V[W] = ⌦ 1 2 . (47) m (m) distributed variable for the case where the shape pa- ✓ ◆ ! rameter is v/2 and the scale factor is 2. Therefore we The method we have used to derive the distribution of will approximate Q(x)=xT x with a single gamma the arc length integrand is summarised in Eqn 48: random variable by moment matching the first two moments. The mean and variance of Q(x) are given Q pQ (µ, ⌃) Gamma(kG, ✓G) Nakagami(m, ⌦) by: N Approximate! Exact! T (48) E[Q(x)] = tr(⌃)+µ µ, (38) T V[Q(x)] = 2tr(⌃⌃)+4µ ⌃µ, (39) Numerical samples of Q(x) and Q(x) and the pdf of the corresponding gamma and Nakagami distributions p where tr() denotes that trace of a matrix. The pdf of are show in Figure 3 for d = 3. The approximated a with shape kG and scale ✓G is distributions show a reasonable approximation for a given by range of µ and ⌃.

k 1 x x G exp The quadratic form approximated to the gamma dis- ✓G pG(x : kG, ✓G)= . (40) tribution is exact when all the eigenvalues of the co- kG ⇣ ⌘ ✓G (kG) variance are identical, in that case we have only a sin- The first two moments are: gle gamma random variable. 2 2 µG = kG✓G, G = kG✓G. (41) 4.2 Arc Length Statistics

Solving for kG and ✓G: We are now in a position to consider the arc length 2 2 µG G directly. Taking the expectation of the arc length, re- kG = 2 , ✓G = . (42) G µG calling that expectation is a linear operator and using Fubini’s theroem: Equating moments, we set µG = E[Q(x)] and 2 T = V[Q(x)]. Thus, Q is approximated T 1 G [s]= (f 0 f 0) 2 dt. (49) as a gamma random variable and we write, E E Z0 Q(x) Gamma(kG, ✓G). h i ⇠ Recalling the form of our kernel as K(x, x0)=B k(x, x ), the infinitesimal distribution of f is constant⌦ Now we are in a position to consider the quan- 0 0 with respect to t with covariance given by: tity pQ. Here we use that fact that if a random 2 variable Q Gamma(kG, ✓G), then the random @ 2 ⌃ =B k(x, x0) =B . (50) ⇠ f 0 f 0 variable W = pQ is a Nakagami random variable ⌦ @x@x0 x=x0 W Nagakami(m, ⌦), with parameters given by ⇠ Therefore the expected length of the arc length is: m = kG and ⌦ = kG✓G. The nagakami distribution 1 [8] is: (m + 1 ) 2 f 0 2 ⌦f 0 m E[s] T , (51) 2m 2m 1 m 2 ⇡ (mf ) mf p (x; m, ✓)= x exp x . (43) 0 ✓ 0 ◆ Nak (m)⌦m ⌦ ⇣ ⌘ with, Using the value for k and ✓ obtained via our moment [tr(⌃ )]2 matched approximation and transforming to the Nak- m = f 0 , ⌦ =tr(⌃ ), (52) f 0 2tr(⌃ ⌃ ) f 0 f 0 agami distribution we say pQ is approximated as a f 0 f 0 Nakagami distribution with parameters: where we have used the Nakagami approximation to the arc length integrand to evaluate the mean. As in µ2 m = G , ⌦ = µ . (44) the one-dimensional case, the expected length of the 2 G G gp is determined solely by the choice of kernel and the In terms of our original distribution x (µ, ⌃), we length of the interval. The calculation of the variance ⇠ N therefore have W = pxT x Nakagami(m, ⌦), with: requires the second moment: ⇠ [tr(⌃)+µT µ]2 2 T E[s ]= E f 0 f 0 dt1dt2. (53) m = , ⌦ =tr(⌃)+µ µ. (45) | t1 ||t2 | 2tr(⌃⌃)+4µT ⌃µ ZZ ⇥ ⇤ Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

Figure 3: Samples from Q(x) and Q(x) overlaid with the approximated gamma and Nakagami distributions. x (0, ⌃) in the top row, and x (µ, ⌃) in the bottom row with µ and ⌃ randomly generated. Similar plots p are⇠ obtained for di↵erent values⇠ of µ and ⌃. The gamma and Nakagami distributions provide a reasonable approximation to the shape of the distribution, whilst capturing the true mean and variance.

Making use of the Nakagami approximation to our We derive the correlation function: integrand we need the mixed moment of two corre- lated Nakagami variables. Let us write f 0 W1, | t1 | ⇡ @2 2 1 ft02 W2,withW1 Nakagmi(mf 0 , ⌦f 0 ) and W2 ⇢(t t )= k(t, t ) . (57) | | ⇡ ⇠ ⇠ 0 0 4 Nakagmi(m , ⌦ ). The mixed moments of two cor- @t@t0 f f 0 f 0  0 related Nakagami variables with the same parameters is given by [17]: Eqn 56 can be solved numerically (noting that the two-

2 dimensional integral is readily tackled using traditional n l ⌦ [(m + n/2)] n l methods of quadrature) and the variance is then com- E[W1 W2]= 2F1 , ,m: ⇢(⌧) , m [(m)]2 2 2 2 2 ✓ ◆ puted by V[s]=E[s ] E[s] . (54) where ⇢(⌧) is the correlation between the gamma vari- 4.3 Arc Length Posterior ables that the Nakagami distribution was derived from and 2F1 is the hypergeometric function: The moments of the arc length of a gp posterior follow a similar derivation. The posterior mean of the arc n 1 (a)n(b)n z length is: F (a, b, c : z)= (55) 2 1 (c) n! n=0 n X 1 T (m + 1 ) 2 f 0 2 ⌦f 0 with the Pochhammer (q)n symbol defined as (q)n = E[s] dt, (58) ⇡ (mf ) mf q(q+1) ...(q+n 1) and (q)0 = 1. The second moment Z0 0 ✓ 0 ◆ can now be expressed as a power series in ⇢:

where mf 0 and ⌦f 0 are the Nakagami parameters 2 1 1 2 ⌦ [(m +1/2)] 1 2 2 1 which now depend on the mean and covariance func- E[s ]= n n m [(m)]2 (m) n! tions of the gp posterior, which themselves are func- n=0 n T T X tions of t. This non tractable expression now requires ⇢(t t )ndt dt . (56) an integration (which, again, can be eciently approx- 1 2 1 2 Z0 Z0 imated with quadrature). The posterior second mo- Distribution of Gaussian Process Arc Lengths ment is given by:

1 1 1 1 T T 2 2 2 1 2 ( 2 )n ⌦1 ⌦2 E[s ] n ⇡ n! m m n=0 0 0 1 2 X Z Z ✓ ◆ ✓ ◆ (m1 +1/2)(m2 +1/2) n ⇢( t1 t2 ) dt1dt2, (m1)(m2)(m2)n | | (59) where mi and ⌦i again depend on the mean and covari- ance functions of the gp posterior and are evaluated at ti.

5 SIMULATIONS

In this section we generate samples from our gp prior Figure 4: Histogram of GP Lengths. The theoretical and compute the arc length, focusing on the vector and empirical mean are shown and the corresponding case. We show the e↵ect of the kernel choice and show variance. The Nakagami distribution of the integrand the fidelity of our theoretical results. To generate our is also shown. We can see the integral over integrands curves we specify a zero mean gp kernel, K = B has the e↵ect of shrinking the variance relative to a ⌦ single integrand. k(t, t0), with fixed B and we use the Mat´ern Kernel with ⌫ =3/2, which we call the M32 kernel: arc length. Importantly, we are also able to derive the 2 p3 t t0 p3 t t0 k(t, t0)= 1+ || || exp || || . first, Eqn 58, and second, Eqn 59, moment of the arc ! ! length of a gp posterior, conditioned on observations (60) of the function.. The moments were shown to depend on the choice of We draw a sample fi =(xi,yi,zi) evaluated at evenly spaced t. The arc length of the gp draw is then com- kernel, the hyperparameters and the length of the in- puted numerically. terval. Numerical experiments confirmed the fidelity of our approximation to the arc length integrand and Unit variance and length scale parameters are cho- the arc length moments. We also provide a visual un- sen and the arc length is computed over the interval derstanding of the distribution. t =[0, 1]. Figure 4 shows the sample lengths, the theoretical mean and variance, and the Nakagami dis- We see knowledge of the arc length as a valuable tool tribution of a single arc length integrand. Our theo- which will allow us to encode more information into retical results are close to the numerically generated our prior over kernel choices. The explicit relation be- values. The plot of the Nakagami distribution demon- tween the arc length moments and the kernel hyper- strates the wide variance of an individual arc length parameters allow us to use prior information to better integrand with respect to the overall variance. We see initialize and constrain our models, in particular in that the integration over the input domain has a sort cases where lengths correspond to interpretable quan- of ‘shrinking’ e↵ect on overall variance when compared tities, such as a path trajectory. to the individual variance. No estimation methods are Potential avenues of future research include analysis required to calculate the arc length statistics. Our ap- of the non-stationarity of curves, curve minimisations proximated equations are closed form (for the mean) problems, and generating curves of a given length. and a quadrature problem (for the variance). Furthermore, we see potential application in Bayesian optimization, as a path planning tool and for con- 6 CONCLUSION structing interpretable features from functional data.

In this paper we derive the moments of a vector val- Acknowledgements ued gp. To the best of the authors’ knowledge, this is AT and MO are grateful for the support of funding the first treatment of the arc length in more than one from the Korea Institute of Energy Technology Eval- dimension. The increment distribution was approx- uation and Planning (KETEP). imated via its moments to a Nakagami distribution which provide a closed form for the mean, Eqn 51 and The authors are grateful for initial conversations with an expression for the second moment, Eqn 56, of the Tom Gunter who highlighted the gap in the literature. Justin D. Bewsher, Alessandra Tosi, Michael A. Osborne, Stephen J. Roberts

References [14] I. Miller and J. E. Freund, Expected Arc Length of a Gaussian Process on a Finite Inter- [1] M. A. Alvarez, L. Rosasco, and N. D. val, Journal of the Royal Statistical Society. Series Lawrence, Kernels for Vector-Valued Func- B (Methodology), 18 (1956), pp. 257–258. tions: a Review, Now Publishers Inc, 2012. [15] M. Moll and L. Kavraki, Path planning for [2] R. Barakat and E. Baumann, Mean and Vari- minimal energy curves of constant length,IEEE ance of the Arc Length of a Gaussian Process on a International Conference on Robotics and Au- Finite Interval, International Journal of Control, tomation, 2004. Proceedings. ICRA ’04. 2004, 3 12 (1970), pp. 377–383. (2004), pp. 2826–2831.

[3] S. Corrsin and O. M. Phillips, Contour [16] V. P. Nosko, On the Distribution of the Arc Length and Surface Area of Multiple-Valued Ran- Length of a High-Level Excursion of a Station- dom Variables, Journal of the Society for Indus- ary Gaussian Process, Society for Industrial and trial and Applied Mathematics, 9 (1961), pp. 395– Applied Mathematics, (1985), pp. 521–523. 404. [17] J. Reig, L. Rubio, and N. Cardona, Bivari- [4] M. P. do Carmo, Di↵erential Geometry of ate Nakagami-m distribution with arbitrary fad- Curves and Surfaces, Pearson, 1976. ing parameters, Electronics Letters, 38 (2002), pp. 1715–1717. [5] G. Fubini, Sugli Integrali Multipli, vol. 5, 1907. [18] J. Snoek, H. Larochelle, and R. Adams, [6] S. Hauberg, M. Schober, M. Liptrot, Practical Bayesian Optimization of Machine P. Hennig, and A. Feragen, A random Rie- Learning Algorithms, in Advances in Neural In- mannian metric for probabilistic shortest-path formation Processing Systems 25, 2012, pp. 2960– tractography, Lecture Notes in Computer Science 2968. (including subseries Lecture Notes in Artificial In- telligence and Lecture Notes in Bioinformatics), [19] A. Tosi, S. Hauberg, A. Vellido, and 9349 (2015), pp. 597–604. N. D. Lawrence, Metrics for Probabilistic Ge- ometries, Uncertainty in Artificial Intelligence, [7] P. Hennig and S. Hauberg, Probabilistic Solu- (2014), p. 800. tions to Di↵erential Equations and their Applica- [20] J. M. Wang, D. J. Fleet, and A. Hertz- tion to Riemannian Statistics, 17th International mann, Gaussian Process Dynamical Models, Neu- Conference on Artificial Intelligence and Statistics ral Information Processing Systems (NIPS), 18 (AISTATS) 2014, (2014). (2005), p. 3. [8] W. C. Hoffman, Statistical Methods in Radio [21] T. D. Wickramarachchi, C. Gallagher, Wave Propagation, 1958, pp. 3–36. and R. Lund, Arc length asymptotics for mul- [9] N. L. Johnson, S. Kotz, and N. Balakrish- tivariate time series, Applied Stochastic Models nan, Continuous Univariate Distributions, vol. 1, in Business and Industry, 31 (2015), pp. 264–281. Wiley, 2nd ed., 1994. [22] C. K. I. Williams and C. E. Rasmussen, [10] N. D. Lawrence, Probabilistic non-linear prin- Gaussian Processes for Machine Learning,MIT cipal component analysis with Gaussian process Press, Cambridge, 2006. latent variable models, Journal of machine learn- ing research, 6 (2005), pp. 1783–1816.

[11] R. Marchant and F. Ramos, Bayesian optimi- sation for informative continuous path planning, in IEEE International Conference on Robotics and Automation (ICRA), 2014.

[12] A. M. Mathai and S. B. Provost, Quadratic Forms in Random Variables: Theory and Appli- cations, Marcel Dekker Inc., 1992.

[13] D. Middleton, An Introduction to Statistical Communication Theory, 1960. Distribution of Gaussian Process Arc Lengths

Supplementary Material

Density of Y = p1+X2

Cumulative distribution of Y :

P (Y

1 EpY (y)[y]= ypY (y)dy (67) Z1 2 2 2 2 1 1 ( y 1+µ) ( y 1 µ) 2 2 1 = exp +exp y (y 1) 2 dy. (68) p2⇡ 22 22 Z1 " p ! p !# At first glance this looks to be an intractable integral, however, with the change of variables y =(x2 + 1)1/2 and by expanding the exponential cross terms we arrive at:

1 1 µ2 1 l + µ 2l 1 1 E[y]= exp 2 U l + ,l+2, . (69) p 22 (2l)! 2 2 22 2⇡ l=0 ✓ ◆ X ⇣ ⌘ ✓ ◆