arXiv:2010.09830v3 [math.ST] 24 Feb 2021 rdc h au o nuse on rmtann aa Rat b data. can GP training function, from parametric point undetermined unseen an an of parameters for value the predict atrn ievreyo aafaueb hyper-parameter by feature data of un variety expressing wide and a obtaining capturing of approach integrated a simple such properties, a desirable many functions. to over due defines posterior problems GP nonlinear a A achieve can directly. it functions values, over function distribution a infer to similar the of a and learning lazy uses that learning consider be theory. to processes and Gaussian led have field, the toget in factors, sults These processe analysis. Gaussian stochastic of of fundamentals understanding natural an arise Furthermore, both they ments. as motion, results Brownian general of some construction processes, stochastic of theory the In Introduction 1 1 olg fEgneig ahmtc n hsclSciences Physical and Engineering, of College * orsodn uhr eu hn mi:z.chen3@exeter. Email: Chen, Zexun author: Corresponding oaasGusa rcse G)aeas fe considered often also are (GP) processes Gaussian Nowadays 3 olg fMteais hsc n nomto Engineeri Information and Physics Mathematics, of College ihpolm eae oetmto,dtcin n ayst many and detection, estimation, to related problems with etrvle tcatcpoess npriua multiv particular in processes, stochastic vector-valued indsrbto,mti-ait asindistribution Gaussian matrix-variate distribution, sian re nrdcint utvraeGusa rcs regre mul problems. process a prediction Gaussian of multi-output case multivariate special to a introduction as brief lemma Itô including motion independen Brownian and stationarity severa strict addition, as In such processes, proof. sian existence an on provide based and r processes spaces, correlated Gaussian multivariate multiple of with definition problems cise applied many for theory h atdvlpeto asinpoesapiain,i i it applications, process Gaussian of u development fast common the The results. strong of wealth a and importance their 2 Keywords asinpoessocp n ftelaigpae nmoder in places leading the of one occupy processes Gaussian olg fSine ntdAa mrtsUiest,PO B P.O. University, Emirates Arab United Science, of College eak nmliait asinProcess Gaussian multivariate on Remarks asinmaue asinpoes utvraeGaussi multivariate process, , Gaussian — eu Chen Zexun * 1 103 China 314033, u Fan Jun , Q,UK 4QF, Abstract cu . ac.uk 1 r-rwinmotion pre-Brownian , e ihtesmlct n elho motn re- important of wealth and simplicity the with her raeGusa rcse,wihi h essential the is which processes, Gaussian ariate e r nrdcd efrhrdrv multivariate derive further We introduced. are ce, 2 Phsbe rvnt ea fetv ehdfor method effective an be to proven been has GP n u Wang Kuo and , so saueu ttsia erigmto for method learning statistical useful a as ssion nGusa rcse lyaesnilrl nthe in role essential a play processes Gaussian on eesr ocnoiaetefnaetl of fundamentals the consolidate to necessary s udmna rpriso utvraeGaus- multivariate of properties fundamental l asinmaue nvco-audfunction vector-valued on measures Gaussian doeo h usadn u-ed fmodern of sub-fields outstanding the of one ed la tutr ihBysa interpretation, Bayesian with structure clear a s yfo h eurmn fidpnetincre- independent of requirement the from ly ro vrfntos ie oeobserved some Given functions. over prior a tsia rmcielann oes With models. learning machine or atistical 1,2.SneNa 1]rvae htmany that revealed [11] Neal Since 2]. [13, s etit npeitosadtecpblt of capability the and predictions in certainty soss nti ae,w rps pre- a propose we paper, this In esponses. logvsabte nesadn fmany of understanding better a gives also s t ewe ons(h enlfnto)to function) kernel (the points between ity sda o-aaercmdli order in model non-parametric a as used e eo asinpoessi nconnection in is processes Gaussian of se iait asinpoes n rsn a present and process, Gaussian tivariate e hnifrigadsrbto vrthe over distribution a inferring than her ttsisadpoaiiyter u to due theory probability and statistics n ntecneto uevsdmachine supervised of context the in npoes utvraeGaus- multivariate process, an nvriyo xtr EX4 Exeter, of University , g ixn university, Jiaxing ng, 3 x151 UAE 15551, ox Bayesian neural networks converge to Gaussian processes in the limit of an infinite number of hidden units [16], GP has been widely used as an alternative to neural networks in order to solve complicated regression and classification problems in many areas, e.g., Bayesian optimisation [5], time series forecasting [3, 10], feature selection [14], and so on. With the development of Gaussian processes related to machine learning algorithms, the application of Gaussian processes has faced a conspicuous limitation. The classical GP model can be only used to deal with a single output or single response problem because the process itself is defined on R, and as a result the correlation between multiple tasks or responses cannot be taken into consideration [2, 15]. In order to overcome the drawback above, many advanced Gaussian process model was proposed, including dependent Gaussian process [2], Gaussian process regression with multiple response variables [15], and Gaussian process regression for vector-valued function [1]. The general idea of these methods is to vectorise the multi-response variables and construct a "big" covariance, which describes the correlations between the inputs as well as between the outputs. Intrinsically, these approaches depend on the fact that the matrix- variate Gaussian distributions can be reformulated as multivariate Gaussian distributions, and these are still conventional Gaussian process regression models since the reformulation merely vectorises the multi- response variables, which are assumed to follow a developed case of GP with a reproduced kernel [6]. In another development, Chen et al. [4] defined multivariate Gaussian processes (MV-GP) and proposed a unified framework to perform multi-output prediction using Gaussian processes. This framework does not rely on the equivalence between vectorised matrix-variate Gaussian distribution and multivariate Gaus- sian distribution, and it can be easily used to produce a general elliptical process model, for example, mul- tivariate Student-t process (MV-TP) for multi-output prediction. Both MV-GPR and MV-TPR have closed- form expressions for the marginal likelihoods and predictive distributions under this unified framework and thus can adopt the same optimization approaches as used in the conventional GP regression. Although Chen et al. [4] showed the usefulness of the proposed methods via data-driven examples, some theoretical issues of multivariate Gaussian processes are still not clear, e.g., the existence of MV-GP. When it comes to the theoretical fundamentals of stochastic processes, a close look at measure theory is indispensable. Briefly speaking, (multivariate) Gaussian distributions are Gaussian measures on Rn, and R Gaussian processes are Gaussian measures on the function ( T, ) (for details refer to Definition 2.3, Definition 2.5, and Theorem 2.2 below). Based on the relationship betweenF Gaussian measures and Gaus- sian processes, we properly defined multivariate Gaussian processes by extending Gaussian measures on function spaces to vector-valued function spaces. The paper is organised as follows. Section 2 introduces some preliminaries of Gaussian processes, in- cluding some useful properties and the proof of existence. Section 3 presents some theoretical definitions of multivariate Gaussian process with the proof of existence. The examples and application of multivariate Gaussian processes which show their usefulness is presented in Section 4 and Section 5. Conclusions and a discussion are given in Section 6.

2 Preliminary of Gaussian process

2.1 A stochastic (or random) process is defined by a collection of random variables defined on a common (Ω, , ), where Ω is a , is a σ-algebra and is a ; and the random variables,F P indexed by some set T, all takeF values in the same mathematicalP space S, which must be measurable with respect to some σ -algebra Σ [8]. In other words, for a given probability space (Ω, , ) and a measurable space (S, Σ), a stochastic process is a collection of S-valued random variables, whichF canP be written as: f (t) : t T . { ∈ } A stochastic process can be interpreted or defined as a ST-valued , where ST is the space of all the possible S-valued functions of t T that map from the set T into the space S [7]. The set T ∈

2 is usually one of these:

R, Rn, R+ =[0, +∞), Z = , 1,0,1, , Z+ =(0,1, ). {··· − ···} ··· If T = Z or Z+, we always call it random sequence. If T = Rn with n > 1, the process is often considered as a random field. The set S is called state space and usually formulated as one of these:

R, 0,1 , Z+, D = A, B, C, . { } { ···} Indeed, the random variable of the process is not required to be in the form of one of these above sets, but it must have the same measurable S. For example, if S = Rd or Dd with d > 1, it is called vector-valued process.

2.2 Gaussian measure and distribution Definition 2.1 (Gaussian measure on R). Let (R) denote the completion of the Borel σ-algebra on R. Let λ : (R) [0, +∞] denote the usual LebesgueB measure. Then the Borel probability measure γ : (R) [0,1B] is Gaussian7→ with µ R and σ2 > 0, B 7→ ∈ 1 (x µ)2 γ(A) = exp − dλ(x) √ 2 − 2σ2 ZA 2πσ   for any measurable set A (R). ∈B A random variable X on a probability space (Ω, , ) is Gaussian with mean µ and variance σ2 if its distribution measure is Gaussian, i.e. F P (X A) = γ(A) P ∈ As we know from the view of random variable, we have

Definition 2.2. An n-dimensional random vector X = (X1, , Xn) is Gaussian if and only if a, X := T ··· h i a X = ∑ a X is a Gaussian random variable for all a =(a , , a ) Rn. i i 1 ··· n ∈ In terms of measure, Definition 2.3 (Gaussian measure on Rn). Let γ be a Borel probability measure on Rn. For each a Rn, denote a random variable Y(x Rn) as a mapping x a, x R on the probability space (Rn, (R∈n), γ). The Borel probability measure∈γ is a Gaussian measure7→ h oni∈Rn if and only if the random variableB Y is Gaussian for each a. A matrix Gaussian distribution in statistics is a by generalizing the multivariate to matrix-valued random variables, which can be defined by multivariate Gaussian distribution. Definition 2.4 (Matrix Gaussian distribution). The random matrix is said to be Gaussian [6]:

X (M, U, V), ∼ MN n,d if and only if vec(X) (vec(M), V U), ∼Nnd ⊗ where denotes the Kronecker products and vec(X) denotes the vectorisation of X. ⊗ Theorem 2.1 (Marginalization and conditional distribution [4, 6]). Let

X (M, Σ, Λ) ∼ MN n,d

3 and partition X, M, Σ and Λ as

X1r n1 M1r n1 X = = [X1c X2c ] , M = = [M1c M2c ] X2r n2 M2r n2   d1 d2   d1 d2

Σ Σ Λ Λ Σ = 11 12 n1 and Λ = 11 12 d1 , Σ Σ n Λ Λ d 21 22  2 21 22  2 n1 n2 d1 d2 where n1, n2, d1, d2 is the column or row length of the corresponding vector or matrix. Then, 1. X (M , Σ , Λ), 1r ∼ MN n1,d 1r 11 1 X2r X1r n ,d M2r + Σ21Σ− (X1r M1r), Σ22 1, Λ ; | ∼ MN 2 11 − ·   2. X (M , Σ, Λ ), 1c ∼ MN n,d1 1c 11 1 X2c X1c n,d M2c +(X1c M1c)Λ− Λ12, Σ, Λ22 1 ; | ∼ MN 2 − 11 ·   where Σ22 1and Λ22 1 are the Schur complement [19] of Σ11 and Λ11, respectively, · · 1 1 Σ22 1 = Σ22 Σ21Σ− Σ12, Λ22 1 = Λ22 Λ21Λ− Λ12. · − 11 · − 11 2.3 Gaussian process R R Consider the space T of all -valued functions on T. A subset of the form f : f (ti) Ai,1 i n for R { ∈ ≤ ≤ } some n 1, ti T and some Borel sets Ai is called a cylinder set. Let be the σ-algebra generated by all cylinder≥ sets.∈ ⊆ F R Also, we may consider the product topology on T, which defined as the smallest topology that makes Π R Rn the projection maps t1,...,tn ( f ) = [ f (t1) ,..., f (tn)] from T to measurable, and define as the Borel σ-algebra of this topology. We can obtain: F R R Definition 2.5 (Gaussian measure on ( T, )). A measure γ on ( T, ) is called as a Gaussian measure if F Π 1 F Rn for any n 1 and t1, , tn T, the push-forward measure γ t−, ,t on is a Gaussian measure. ≥ ··· ∈ ◦ 1 ··· n Theorem 2.2 (Relationship between Gaussian process and Gaussian measure). If X =(Xt)t T is a Gaussian 1 R R ∈ process, then the push-forward measure γ = X− with X : Ω T is Gaussian on T, namely, γ is a Gaussian measure on (R , ). Conversely, if γ is a GaussianP ◦ measure on (7→R , ), then on the probability space (R , , γ), T F T F T F the co-ordinate random variable Π =(Πt)t T is from a Gaussian process. ∈ The proof of the relationship between Gaussian process and Gaussian measure can be found in [12]. Theorem 2.3 (Existence of Gaussian process). For any index set T, any mean function µ : T R and any covariance function (function has covariance form), k : T T R, there exists a probability space 7→(Ω, , ) and a Gaussian process (µ, k) on this space, whose mean function× 7→ is µ and covariance function is k. It isF denotedP as X (µ, k). GP ∼ GP Proof. Thanks to Theorem 2.2, we just need to prove the existence of Gaussian measure with the specific mean vector generated by mean function and specific covariance matrix generated by covariance function. > Rn Given n 1, for every t1, , tn T, a Gaussian measure γt1, ,tn on satisfies the assumptions of Daniell-Kolmogorov theorem··· because∈ the projection of Gaussian··· distribution on Rn with n-dimensional

4 Rn Rn n vector [µ(t1), , µ(tn)] and n n covariance matrix K =(ki,j) × , to the first n 1 co-ordinates, ··· ∈ × ∈ −Rn 1 is precisely a Gaussian distribution with n 1-dimensional vector [µ(t1), , µ(tn 1)] − and (n − ··· − ∈ − 1) (n 1) covariance matrix K =(k ) R(n 1) (n 1). By the Daniell-Kolmogorov theorem, there exists × − i,j ∈ − × − a probability space (Ω, , ) as well as a Gaussian process X = (Xt)t T (µ, k) defined on this space F P ∈ ∼ GP such that any finite dimensional distribution of [Xt , , Xt ] is given by the measure γt , ,t . 1 ··· n 1 ··· n 3 Multivariate Gaussian process

Following the classical theory of Gaussian measure and Gaussian process, we can introduce Gaussain mea- Rn d Rn sure on × and Gaussian measure on (( )T, ), and finally define the multivariate Gaussian process. G n d According to Definition 2.3 and Definition 2.4, we can have a definition of Gaussian measure on R × . n d n d Definition 3.1 (Gaussian measure on R × ). Let γ be a Borel probability measure on R × . For each a nd n d ∈ R , denote a random variable Y(x R × ) as a mapping x a,vec(x) R on the probability space n d n d ∈ 7→ h i ∈ n d (R × , (R × ), γ). The Borel probability measure γ is a Gaussian measure on R × if and only if the randomB variable Y is Gaussian for each a. Rd Rd Similarly to the introduction in Gaussian process, now we consider the space ( )T of all -valued functions on T. Let be a σ-algebra generated by all cylinder sets where each cylinder set here defined as a G Rd subset of the form f : f(ti) Bi,1 i n for some n 1, ti T and some Borel sets Bi . Also, we { ∈ R≤d ≤ } ≥ ∈ Ξ ⊆ can define the smallest topology on that makes the projection mappings t1,...,tn ( f ) = [ f (t1) ,..., f (tn)] Rd Rn d from ( )T to × measurable, and define as the Borel σ-algebra of this topology. Thus we can have a definition of Gaussian measure on ((Rd) , )G. T G Definition 3.2 (Gaussian measure on ((Rd) , )). A measure γ on ((Rd) , ) is called as a Gaussian mea- T G T G sure if for any n 1 and t , , t T, the push-forward measure γ Ξ 1 on Rn d is a Gaussian 1 n t−1, ,tn × measure. ≥ ··· ∈ ◦ ··· Since the relationship between Gaussian process and Gaussian measure in Theorem 2.2, we can well define multivariate Gaussian process (MV-GP).

Rd Definition 3.3 (d-variate Gaussian process). Given a Gaussian measure on (( )T, ), d 1, the co- Rd G ≥ ordinate random vector Ξ = (Ξt)t T on the probability space (( )T, , γ) is said to be from a d-variate Gaussian process. ∈ G Theorem 3.1 (Existence of d-variate Gaussian process). For any index set T, any vector-valued mean function d d d u : T R , any covariance function k : T T R and any positive semi-definite parameter matrix Λ R × , there exists7→ a probability space (Ω, , ) and× a d-variate7→ Gaussian process f(x) on this space, whose mean∈ function is u, covariance function is k and parameterG P matrix is Λ , such that,

• E[f(t)] = u(t), t T, ∀ ∈ T • E (f(t ) u(t ))(f(t ) u(t )) = tr(Λ)k(t , t ), t , t T s − s l − l s l ∀ s l ∈ h i E F T F • [( t , ,t Mt , ,t ) ( t , ,t Mt , ,t )] = tr(Kt , ,t )Λ, n 1, t1, , tn T, where 1 ··· n − 1 ··· n 1 ··· n − 1 ··· n 1 ··· n ∀ ≥ ··· ∈ T T T Mt , ,t =[u(t1) , , u(tn) ] 1 ··· n T ··· T T Ft , ,t =[f(t1) , , f(tn) ] 1 ··· n ··· k(t1, t1) k(t1, tn) . ···. . Kt1, ,tn = . .. . ···  . .  k(t , t ) k(t , t )  n 1 ··· n n   

5 It denotes f (u, k, Λ). ∼ MGP d > Rn d Proof. Given n 1, for every t1, , tn T, a Gaussian measure γt1, ,tn on × satisfies the assump- ··· ∈ ··· n d tions of Daniell-Kolmogorov theorem because the projection of a matrix Gaussian distribution on R × T T T with u(t ) , , u(t ) Rn d, n n column covariance matrix K = (k ) Rn n, and d d 1 ··· n ∈ × × i,j ∈ × × d d row covarianceh matrix Λi R × , to the first n 1 co-ordinates, is precisely the Gaussian distribution ∈ T − T T R(n 1) d with u(t1) , , u(tn 1) − × , (n 1) (n 1) column covariance matrix K = (ki,j) ··· − ∈ − × − ∈ (n 1) (n 1) d d R − h× − , and row covariancei matrix Λ R × . This is due to the conditional property of matrix Gaussian distribution shown in Theorem 2.1.∈ By the Daniell-Kolmogorov theorem, there exists a probabil- ity space (Ω, , ) as well as a d-variate Gaussian process X = (Xt)t T d(u, k, Λ) defined on this G P ∈ ∼ MGP space such that any finite dimensional distribution of [Xt , , Xt ] is given by the measure γt , ,t . 1 ··· n 1 ··· n Following the existence of d-variate Gaussian process, we can also achieve some properties as follow.

Proposition 3.1 (Strictly stationary). A d-variate Gaussian process d(u, k, Λ) is said to be strictly stationary if MGP u(t) = u(t + h), k(t + h, t + h) = k(t , k ), t, t , t , h T. s l s l ∀ s l ∈ Proof. Assume f (u, k, Λ), then for n 1, t , , t T, ∼ MGP d ∀ ≥ 1 ··· n ∈ T T T [f(t1) ,..., f(tn) ] (ut , ,t , Kt , ,t , Λ), ∼ MN 1 ··· n 1 ··· n where u(t1) k(t1, t1) k(t1, tn) . . ···. . ut , ,t = . , Kt , ,t = . . . 1 ··· n  .  1 ··· n  . . .  u(t ) k(t , t ) k(t , t )  n   n 1 ··· n n      Given any time increment h T, there also exists, ∈ T T T [f(t1 + h) ,..., f(tn + h) ] (ut +h, ,t +h, Kt +h, ,t +h, Λ), ∼ MN 1 ··· n 1 ··· n where

u(t1 + h) k(t1 + h, t1 + h) k(t1 + h, tn + h) . . ···. . ut +h, ,tn+h = . , Kt +h, ,tn+h = . .. . . 1 ···  .  1 ···  . .  u(t + h) k(t + h, t + h) k(t + h, t + h)  n   n 1 ··· n n      Since u(t) = u(t + h), k(t + h, t + h) = k(t , k ), t, t , t , h T, s l s l ∀ s l ∈

u(t1 + h) u(t1) . . ut +h, ,t +h = . = . = ut1, ,tn , 1 ··· n  .   .  ··· u(t + h) u(t )  n   n      k(t1 + h, t1 + h) k(t1 + h, tn + h) k(t1, t1) k(t1, tn) . ···. . . ···. . Kt +h, ,tn+h = . .. . = . .. . = Kt1, ,tn . 1 ···  . .   . .  ··· k(t + h, t + h) k(t + h, t + h) k(t , t ) k(t , t )  n 1 ··· n n   n 1 ··· n n  T T T   T T T  Therefore, [f(t1 + h) ,..., f(tn + h) ] has the same distribution as [f(t1) ,..., f(tn) ] . Due to the arbi- trary choice of n > 1 and t , , t T, f (u, k, Λ) is a strictly stationary process. 1 ··· n ∈ ∼ MGP d

6 Proposition 3.2 (Independence). A d collection of functions fi i=1,2, ,d identically independently follows a Gaussian process (µ, k) if and only if { } ··· GP f =[f , f , , f ] (u, k, Λ), 1 2 ··· d ∼ MGP d where u =[µ, , µ] Rd and Λ is any diagonal positive semi-definite matrix. ··· ∈ Proof. Necessity: if f (u, k, Λ), then for n 1, t , , t T, ∼ MGP d ∀ ≥ 1 ··· n ∈ T T T [f(t1) ,..., f(tn) ] (ut , ,t , Kt , ,t , Λ), ∼ MN 1 ··· n 1 ··· n where, u(t1) k(t1, t1) k(t1, tn) . . ···. . ut , ,t = . , Kt , ,t = . . . 1 ··· n  .  1 ··· n  . . .  u(t ) k(t , t ) k(t , t )  n   n 1 ··· n n  Rewrite the left, we obtain    

[ξ1, ξ2, , ξd] (ut , ,tn , Kt , ,tn , Λ), 1 ··· 1 ··· T ··· ∼ MN where ξ =[ f (t ), f (t ), , f (t )] . Since Λ is a diagonal matrix, for any i = j i i 1 i 2 ··· i n 6 E T [ξ ξj] = tr(Kt , ,t )Λij = tr(Kt , ,t ) 0 = 0. i 1 ··· n 1 ··· n · Because ξi and ξj are any finite number of realisations of fi and fj respectively from the same Gaussian pro- cess, fi and fj are uncorrelated. Due to joint finite realisations of fi and fj follow Gaussian, non-correlation implies independence. Sufficiency: if fi i=1,2, ,d (0, k) are independent, for n 1, t1, , tn T and for any i = j, { } ··· ∼ GP ∀ ≥ ··· ∈ 6 E T 0 = [ξ ξj] = tr(Kt , ,t )Λij. i 1 ··· n Since tr(Kt , ,t ) is non-zero, Λij must be 0. Due the arbitrary choices of i, j, Λ must be diagonal. That is to 1 ··· n T Λ say, ξi =[ fi(t1), fi(t2), , fi(tn)] can be written as a matrix Gaussian distribution (ut1, ,tn , Kt1, ,tn , ) where Λ is a diagonal··· positive semi-definite matrix. Since for n 1, t , , t MNT we hold··· the above··· re- ∀ ≥ 1 ··· n ∈ sult, fi i=1,2, ,d can be considered identically independently Gaussian process (µ, k). { } ··· GP 4 Example: special cases

Instinctively, a special case is centred multivariate Gaussian process where vector-valued mean function µ = 0. The 50 realisation samples generated from centred multivariate Gaussian process are demonstrated in Figure 1:Left. Furthermore, we can derive the multivariate Gaussian white noise and the multivariate Brownian motion.

4.1 Multivariate Gaussian white noise

Proposition 4.1 (d-variate Gaussian white noise). A d-variate Gaussian process d(u, k, Λ) is said to be d-variate Gaussian white noise if u = 0 and k(t , t ) = σ2δ(t t ), where δ is DiracMGP delta function and t , t T. s l s − l s l ∈ Proof. Let f =[f , , f ] (u, k, Λ), then E[f(t)] = u(t) = 0, t T, and 1 ··· d ∼ MGP d ∀ ∈ T 0 if t = t E f u f u Λ s l ( (ts) (ts))( (tl) (tl)) = tr( )k(ts, tl) = 2 6 , ts, tl T − − (σ tr(Λ) if ts = tl ∀ ∈ h i Furthermore, n 1, t , , t T, there exists, ∀ ≥ 1 ··· n ∈ E T Λ 2 Λ 2Λ [Ft , ,t Ft1, ,tn ] = tr(Kt1, ,tn ) = tr(σ Id d) = dσ , 1 ··· n ··· ··· × T T T where Ft , ,t =[f(t1) , , f(tn) ] . 1 ··· n ···

7 5 5

4 4

3 3

2 2 Input x Input x

1 1 10 10 0 0 5 5 -10 -10 -5 0 -5 0 0 0 -5 -5 5 5 Output y Output y Output y 10 -10 2 Output y 10 -10 2 1 1

Figure 1: The 50 random realisation sample points generated from of 2-variate Gaussian process. Left: centred 2-variate Gaussian process with Gaussian covariance function k(ts, tl) = 1.5 exp( (ts 2 2 − − tl) /2/0.5 ). Right: 2-variate Gaussian white noise as a 2-variate Gaussian process with covariance func- tion k(ts, tl) = 1.5δ(ts, tl).

Remark 4.1. We observed in the proof that d-variate Gaussian white noise has independence property as white noise along with T, but it has correlation along with d-variate dimension. Therefore, d-variate Gaus- sian white noise is also called as variate-dependent Gaussian white noise or variate-correlated Gaussian white noise, which is distinct from the traditional d-dimensional independent Gaussian white noise. Here are 50 realisation samples generated from multivariate Gaussian white noise shown in Figure 1:Right.

4.2 Multivariate Brownian motion According to the Chapter 2 of the book written by Le Gall [9], there is a definition of Brownian motion, which is a Gaussian white noise whose intensity is . Since Brownian motion is a special case of Gaussian process with continuous sample paths, mean function u = 0 and covariance function k(s, t) = min(s, t), we propose an example, which is d-variate Brownian motion, as a special case of d- variate Gaussian process with vector-valued mean function u = 0, covariance function k(s, t) = min(s, t) and parameter matrix Λ. Based on the Theorem 3.1, we derived some properties of the traditional Brownian motion to a more general vector-valued case.

Definition 4.1 (d-variate Brownian motion). A d-variate Gaussian process d(u, k, Λ) is said to be d-variate Brownian motion if all sample paths are continuous, u = 0 and k(t ,MGPt ) = min(t t ). s l s − l Let Bt be a d-variate Brownian motion, which for all 0 t1 tn the random variable Z = T T T n d ≤ ≤···≤ (B ,..., B ) R × has a normal distribution on the probability space (Ω, , ) we mentioned before t1 tn ∈ G P in Theorem 3.1. There exists a matrix M Rn d and two non-negative definite matrices C = [c] Rn n ∈ × jm ∈ × and Λ =[λ] Rd d such that ab ∈ × n E T 1 T T exp i ∑ Wj, Zj, = exp ∑ Wj, cjmWm, + i ∑ Wj, Mj, " j=1 · ·!# − 2 j,m · · j · ·! d E T 1 T T exp i ∑ W ,aZ ,a = exp ∑ W ,aλabW + i ∑ W ,a M ,a · · ,b · " a=1 · !# − 2 a,b · a · ! where W = [w] Rn d and i is the imaginary unit. Moreover, we also have the mean value M = E[Z] ja ∈ ×

8 and two covariance matrices E T cjm = [(Zj, Mj, )(Zm, Mm, ) ] · − · · − · E T λab = [(Z ,a M ,a) (Z ,b M ,b)]. · − · · − · Assume that the mean matrix M here is a zero matrix, i.e. E[Z] = E[Z t = 0] = 0, I = Λ and | d t t t 1 1 ··· 1 t1 t2 t2 C =  . . ··· .  . . . .   t t t   1 2 ··· n   Hence, E[B ] = 0 for all t 0 and t ≥ T T E[(Bt)(Bt) ] = dt, E[(Bt)(Bs) ] = d min(s, t), T T E[(Bt) (Bt)] = tΛ, E[(Bt) (Bs)] = min(s, t)Λ. Moreover, we have E T E T T T [(Bt Bs)(Bt Bs) ] = [BtBt 2BsBt + BsBs ] = d t s − T − T − T T | − | E[(B B ) (B B )] = E[B B 2B B + B B ] = t s Λ. t − s t − s t t − s t s s | − | E Note that this d-variate Brownian motion Bt still has independent increments since [(Bti Bti 1)(Btj − − − T E T < < < < Btj 1 ) ] = 0 and [(Bti Bti 1 ) (Btj Btj 1 )] = 0 when ti tj holds for all 0 t1 tn. − − − − − ··· Remark 4.2. Similar to d-variate Gaussian white noise, d-variate Brownian motion also has independence property along with T, but it has correlation along with d-variate dimension. Therefore, d-variate Brownian motion is also called as variate-dependent Brownian motion or variate-correlated Brownian motion, which is distinct from the "traditional" d-dimensional Brownian motion. Actually, the "traditional" d-dimensional Brownian motion is a special case of d-variate Brownian motion with diagonal matrix Λ.

As a Brownian motion, we then introduce Itô lemma for the d-variate Brownian motion. Let Bt = [B1(t), , Bd(t)] be the d-variate Brownian motion derived in Section 4.2. Then, we have the following lemma.··· Lemma 4.1 (Itô lemma for the d-variate Brownian motion). Let F be a twice continuously differentiable real function on Rd+1 and let Λ =[λ] Rd d be the covariance matrix for the d-variate dimension. Then, i,j ∈ × d t ∂F F(t, B (t), , B (t)) = F(0, B (0), , B (0)) + ∑ (s, B (s), , B (s))dB (s) 1 d 1 d ∂B 1 d i ··· ··· i=1 Z0 i ··· t ∂F 1 d ∂2F + (s, B (s), , B (s)) + ∑ (s, B (s), , B (s))λ ds. ∂s 1 d 2 ∂B ∂B 1 d i,j Z0 ( ··· i,j=1 i j ··· ) Proof. By Itô lemma and the definition of the d-variate Brownian motion, we obtain t ∂F F(t, B1(t), , Bd(t)) = F(0, B1(0), , Bd(0)) + (s, B1(s), , Bd(s))ds ··· ··· Z0 ∂s ··· d t ∂F + ∑ (s, B (s), , B (s))dB (s) ∂B 1 d i i=1 Z0 i ··· 1 d t ∂2F + ∑ (s, B (s), , B (s))d B , B (s). 2 ∂B ∂B 1 d i j i,j=1 Z0 i j ··· h i The proof is complete by d B , B (s) = λ ds. h i ji i,j

9 5 Application: multivariate Gaussian process regression

As a useful application, multi-output prediction using multivariate Gaussian process is a good example. Multivariate Gaussian process provides a solid and unified framework to make the prediction with mul- tiple responses by taking advantage of their correlations. As a regression problem, multivariate Gaussian process regression (MV-GPR) have closed-form expressions for the marginal likelihoods and predictive distributions and thus parameter estimation can adopt the same optimization approaches as used in the conventional Gaussian process [4]. As a summary of MV-GPR in [4], the noise-free multi-output regression model is considered and the noise term is incorporated into the kernel function. Given n pairs of observations (x , y ) n , x Rp, y { i i }i=1 i ∈ i ∈ Rd, we assume the following model

f (0, k′, Λ), y = f(x ), for i = 1, , n, ∼ MGP d i i ··· where Λ is an undetermined covariance (correlation) matrix (the relationship between different outputs), 2 k′ = k(xi, xj) + δijσn, and δij is Kronecker delta. According to multivariate Gaussian process, it yields that the collection of functions [f(x1),..., f(xn)] follows a matrix-variate Gaussian distribution T T T [f(x ) ,..., f(x ) ] (0, K′, Λ), 1 n ∼ MN where K′ is the n n covariance matrix of which the (i, j)-th element [K′]ij = k′(xi, xj). Therefore, the predictive targets × T f =[ f 1,..., f m] ∗ ∗ ∗ at the test locations T X =[xn+1,..., xn+m] ∗ is given by p(f X, Y, X ) = (Mˆ , Σˆ , Λˆ ), ∗| ∗ MN where T 1 Mˆ = K′(X , X) K′(X, X)− Y, ∗ T 1 Σˆ = K′(X , X ) K′(X , X) K′(X, X)− K′(X , X), ∗ ∗ − ∗ ∗ and Λˆ = Λ.

Here K′(X, X) is an n n matrix of which the (i, j)-th element [K′(X, X)]ij = k′(xi, xj), K′(X , X) is an × ∗ m n matrix of which the (i, j)-th element [K′(X , X)]ij = k′(xn+i, xj), and K′(X , X ) is an m m matrix × ∗ ∗ ∗ × with the (i, j)-th element [K′(X , X )]ij = k′(xn+i, xn+j). In addition, the expectation and the covariance are obtained, ∗ ∗ T 1 E[f ] = Mˆ = K′(X , X) K′(X, X)− Y, ∗ ∗ T T 1 cov(vec(f )) = Σˆ Λˆ =[K′(X , X ) K′(X , X) K′(X, X)− K′(X , X)] Λ. ∗ ⊗ ∗ ∗ − ∗ ∗ ⊗ From the view of data science, the hyperparameters involved in the covariance function (kernel) k′( , ) and the row covariance matrix of MV-GPR need to be estimated from the training data using many· ap-· proaches [17], such as maximum likelihood estimation, maximum a posteriori and Monte Carlo [18].

6 Conclusion

In this paper, we give a proper definition of the multivariate Gaussian process (MV-GP) and some related properties such as strict stationarity and independence of this process. We also provide the examples of multivariate Gaussian white noise and multivariate Brownian motion including Itô lemma and present an useful application of multivariate Gaussian process regression in statistical learning with our definition.

10 Acknowledgements

The authors would like to thank Dr Youssef El-Khatib for his comments and Dr. Gregory Markowsky for his kind proofreading and very helpful comments.

References

[1] Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012. [2] Phillip Boyle and Marcus Frean. Dependent Gaussian processes. In Advances in neural information processing systems, pages 217–224, 2005. [3] Sofiane Brahim-Belhouari and Amine Bermak. Gaussian process for nonstationary time series predic- tion. Computational Statistics & Data Analysis, 47(4):705–712, 2004. [4] Zexun Chen, Bo Wang, and Alexander N Gorban. Multivariate Gaussian and Student-t process regres- sion for multi-output prediction. Neural Computing and Applications, 32(8):3005–3028, 2020. [5] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018. [6] Arjun K Gupta and Daya K Nagar. Matrix variate distributions, volume 104. CRC Press, 1999. [7] Olav Kallenberg. Foundations of modern probability. Springer Science & Business Media, 2006. [8] John Lamperti. Stochastic processes: a survey of the mathematical theory, volume 23. Springer Science & Business Media, 2012. [9] Jean-François Le Gall. Brownian motion, martingales, and stochastic calculus, volume 274. Springer, 2016. [10] David JC MacKay. Gaussian processes-a replacement for supervised neural networks? 1997. [11] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. [12] Balram S Rajput and Stamatis Cambanis. Gaussian processes and Gaussian measures. The Annals of , pages 1944–1952, 1972. [13] Carl Edward Rasmussen. Evaluation of Gaussian processes and other methods for non-linear regression. University of Toronto, 1999. [14] Terrance Savitsky, Marina Vannucci, and Naijun Sha. Variable selection for nonparametric Gaus- sian process priors: Models and computational strategies. Statistical science: a review journal of the Institute of Mathematical Statistics, 26(1):130, 2011. [15] Bo Wang and Tao Chen. Gaussian process regression with multiple response variables. Chemometrics and Intelligent Laboratory Systems, 142:159–165, 2015. [16] Christopher KI Williams. Computing with infinite networks. Advances in neural information processing systems, pages 295–301, 1997. [17] Christopher KI Williams and David Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [18] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for regression. In Advances in neural information processing systems, pages 514–520, 1996. [19] Fuzhen Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media, 2006.

11