<<

INTRODUCTION TO GAUSSIAN PROCESSES

STEVEN P.LALLEY

1. DEFINITIONS AND EXAMPLES

Definition 1.1. A Gaussian process Xt t T indexed by a set T is a family of (real-valued) random ∈ variables Xt , all defined on the same{ probability} space, such that for any finite subset F T the random vector XF := Xt t F has a (possibly degenerate) Gaussian distribution; if these⊂ finite- dimensional distributions{ } are∈ all non-degenerate then the Gaussian process is said to be non- degenerate. Equivalently, Xt t T is Gaussian if every finite linear combination t F a t Xt is ∈ either identically zero or has{ a Gaussian} distribution on R. A Gaussian process Xt t T is∈ centered ￿∈ if EXt = 0 for every t T , and its is the (symmetric) bivariate{ } function ∈ (1.1) R(s,t )=cov(Xs ,Xt )=E (Xs EXs )(Xt EXt ). − − The canonical metric d associated with a non-degenerate Gaussian process Xt t T is { } ∈ 2 (1.2) d (s,t )= E Xs Xt . | − | Covariance Functions: Recall that the covariance￿ of a multivariate Gaussian random vec- tor is a symmetric, nonnegative definite matrix; the distribution is said to be non-degenerate if its covariance matrix is strictly positive definite. The vector and covariance matrix uniquely determine a Gaussian distribution; consequently, the mean function and of a Gaussian process completely determine all of the finite-dimensional distributions (that is, the joint distributions of finite subsets XF of the random variables). Thus, if two Gaussian processes X = Xt t T and Y = Yt t T have the same mean and covariance functions, then for any event B that{ depends} ∈ on only{ finitely} ∈ many coordinates,

(1.3) P X B = P Y B . { ∈ } { ∈ } Since any event can be arbitrarily well-approximated by events that depend on only finitely many coordinates, it follows that the equality (1.3) holds for all events B. Therefore, the processes X and Y are identical in law. Gaussian processes: Examples

Example 1.1. The most important one-parameter Gaussian processes are the Wt t 0 (), the Ornstein-Uhlenbeck process Yt t R, and the {W } ≥ . These are the centered Gaussian processes with covariance{ } ∈ functions t◦ t [0,1] { } ∈ EWs Wt = min(s,t ),(1.4) EYs Yt = exp t s ,(1.5) {−| − |} EWt◦Ws◦ = min(s,t ) st.(1.6) − 1 2 STEVEN P.LALLEY

Note: In certain situations we truncate the parameter space T – in particular, sometimes we are interested in the Wiener process Wt only for t [0,1], or in the Ornstein-Uhlenbeck process Yt for t 0. ∈ ≥ Exercise 1.1. Check that if Wt is a standard Wiener process, then the derived processes t Wt◦ := Wt tW1 and Yt := e − We 2t − have the same covariance functions as given above, and so these derived processes have the same “finite-dimensional distributions” as the Brownian bridge and Ornstein-Uhlenbeck pro- cess, respectively. Also, check that for any scalar α>0 the process ˜ 1 Wt := α− Wα2t has the same covariance function, and therefore also the same finite-dimensional distributions, as Wt . (This correspondence is called Brownian scaling.)

Exercise 1.2. Let Wt be a standard Wiener process, and let f (t ) be any continuous (nonrandom) function. Define t Zt = Ws f (s )ds ￿0 (The integral is well-defined because the Wiener process has continuous paths.) Show that Zt is a Gaussian process, and calculate its covariance function. HINT: First show that if a sequence Xn of Gaussian random variables converges in distribution, then the limit distribution is Gaussian (but possibly degenerate).

Example 1.2. Let ξ1,ξ2,... be independent, identically distributed unit normals. Then for any finite set of frequencies ωi 0, the process ≥ m (1.7) Xt := ξi cos(ωi t ) i =1 ￿ indexed by t R is a Gaussian process. This process has smooth sample paths (they are just random linear∈ combinations of cosine waves). Note that for any finite set F of cardinality larger than m the random vector XF has a degenerate Gaussian distribution (why?).

Example 1.3. The two-parameter Brownian sheet Ws s R2 is the mean-zero Gaussian process { } ∈ + indexed by ordered pairs s =(s1,s2) of nonnegative reals with covariance function

(1.8) EWs Wt = min(s1,t1)min(s2,t2). r Observe that for each fixed r > 0, the one-parameter process Zs := Ws,r has the same covariance function as a standard Wiener process multiplied by ￿r . Thus, the Brownian sheet has slices in the two coordinate directions that look like scaled Wiener processes. For figures showing simu- lations of Brownian sheets, see Mandelbrot’s book Fractal Geometry of Nature. Construction of Gaussian Processes. It is not at all obvious that the Gaussian processes in Ex- amples 1.1 and 1.3 exist, nor what kind of sample paths/sheets they will have. The difficulty is that uncountably many random variables are involved. We will show that not only do all of the processes above exist, but that they have continuous sample functions. This will be done in two steps: First, we will show that Gaussian processes with countable index sets can always GAUSSIAN PROCESSES 3

be constructed from i.i.d. unit normals. Then, in section 2, we will show that under certain re- strictions on the covariance function a Gaussian process can be extended continuously from a countable dense index set to a continuum. The following example shows that some restriction on the covariance is necessary.

Exercise 1.3. (A) Show that there is no Gaussian process Xt t [0,1] with continuous sample paths and covariance function { } ∈

R(s,t )=0 for s = t and ￿ R(s,s )=1. (B) Show that there is no probability space , ,P that supports random variables X ∗ (Ω ) t t [0,1] ￿ { } ∈ such that (i) the mapping X : Ω [0,1] ￿ defined by X(ω,t )=Xt (ω) is jointly measurable (here the implied σ algebra on [0,1×] is the→ set of Lebesgue measurable subsets of [0,1]); and (ii) the − Xt t [0,1] is centered Gaussian with covariance function R as in part (A). { } ∈ Processes with Countable Index Sets. For each m = 1,2,.., let Fm be a (Borel) probability distri- m bution on ￿ . Assume that these are mutually consistent in the following sense: for each Borel m subset B of ￿ ,

(1.9) Fm (B)=Fm+1(B R); × that is, Fm is the marginal joint distribution of the first m coordinates of a random vector with

distribution Fm+1. I will show that on some probability space are defined random variables Xi such that for each m, the random vector (X1,X2,...,Xm ) has distribution Fm+1. In fact, any prob- ability space that support an i.i.d. sequence Ui of uniform-[0,1] random variables will suffice. Recall that any probability distribution F on ￿ can be “simulated” using a single uniform- [0,1] U, by the quantile transform method. Hence, there is a (Borel) function ϕ1(U1) := X1 that has distribution F1. Now suppose that Xi have been constructed for i m using the values Ui , with i m, in such a way that the joint distribution of the random vector≤ ≤ 1 (Xi )i m is Fm . Let Gm+1 be the conditional distribution on ￿ of the (m + 1)st coordinate of an ≤ Fm+1 distributed random vector given that the first m coordinates have values (Xi )i m . Use − ≤ Um+1 to produce a random variable Xm+1 with conditional distribution Gm+1 given the values (Xi )i m . Then the joint distribution of (Xi )i m+1 will be Fm+1, by the consistency hypothesis. ≤ ≤ Now let R(s,t ) be a positive definite function indexed by a countable set T , that is, a symmet- ric function with the property that for every finite subset F of T , the matrix ΣF :=(R(s,t ))s,t F is ∈ positive definite. Without loss of generality, assume that T = ￿. Then the sequence of distribu- tions Fm = Normal(0,Σm ) is mutually consistent (why?). Therefore, by the preceding paragraph, there exists a sequence of random variables Xi such that, for each finite subset F T the joint distribution of XF is ⊂ Gaussian with mean zero and covariance ΣF . ￿

2. CONTINUOUS EXTENSIONS AND MAXIMA OF GAUSSIAN PROCESSES 2.1. Continuity. Exercise 1.3 shows that there are covariance functions R such that no centered Gaussian process with covariance function R has continuous sample paths. The example pro- vided by Exercise 1.3 is somewhat pathological, though, in that the covariance R is discontinuous 4 STEVEN P.LALLEY

(and the corresponding canonical metric leads to the discrete topology). The following example shows that there are continuous covariance functions that are incompatible with sample-path continuity.

Example 2.1. Let n = n 1n 2 be an infinite sequence of integers n i 2. Define the rooted infinite tree T = Tn to be graph··· whose vertex set consists of all finite sequence≥ i 1i 2 i k such that 1 i n for each j k , and whose edge set consists of all pairs (i,i’) such that i ···is obtained by j j ￿ appending≤ ≤ a single letter≤ (integer) to i, for instance

((2,7,13),(2,7,13,4)). The empty sequence is the root of the tree. The boundary ∂ T is defined to be the set of all infinite self-avoiding paths￿ i = i 1i 2 that start at the root. For each 0 <α<1 there is a nat- N i,i ural metric d on ∂ T defined as follows:··· d i,i α ( ￿) where N i,i is the index of the first α α( ￿)= ( ￿) coordinate where the sequences i,i differ, e.g., ￿ N ((2,7,4,5,8,1,...),(2,7,4,3,3,1,...)) = 4.

Define a centered Gaussian process Xi i ∂ T indexed by the boundary ∂ T as follows: first, attach to each edge e connecting vertices{ at} ∈ levels n 1 and n a mean-zero Gaussian random variable ξ with variance 4 n , in such a way that the− random variables ξ are mutually e − e e ∈￿ independent. For i = i 1i 2 ∂ T define { } ···∈ ∞ Xi = ξen (i) where en (i)=(i 1i 2 i n 1,i 1i 2 i n 1i n ), − − n=1 ··· ··· ￿ that is, sum the random variables ξe along the path from the root leading to i. Observe that for each i ∂ T the random variable Xi is centered Gaussian with variance 1/3. You should check ∈ that the canonical metric is d α for the choice α =? (you figure it out). Theorem 1 below implies 2k that if, for instance, n k 2 then the Gaussian process Xi i ∂ T has a version with continuous k ≤22 { } ∈ paths. However, if n k 2 then it doesn’t. (Exercise!) ≥ Definition 2.1. Let (T,d ) be a compact metric space. For each ￿>0 the Lebesgue covering num- ber N (￿) is the minimum number of (open) ￿ balls needed to cover T . − Theorem 1. Let d be the canonical metric of a non-degenerate, centered Gaussian process Xt t T , ∈ and let N (￿) be the Lebesgue covering number function. If for some ￿>0 { } ￿ (2.1) logN (￿)d ￿< ￿0 ∞ ￿ then the Gaussian process has a version with uniformly continuous sample paths. Consequently (since T is compact), for this version,

(2.2) supXt = maxXt < a.s. t T t T ∞ ∈ ∈ Remark 1. This result can also be used as an extension theorem. In particular, if (T,d ) is not a complete metric space then the theorem implies that under the assumption (2.1) the Gaussian process has a uniformly continuous extension to the completion of (T,d ), by the next lemma. GAUSSIAN PROCESSES 5

Lemma 2.1. Let D be a dense subset of a metric space T , and let f : D ￿ be uniformly continuous on D, i.e., for each ￿>0 there exists δ>0 such that f (y ) f (x) <￿→for any two points x,y D at distance d (x,y ) <δ. Then there is a unique uniformly| continuous− | extension of f to T . ∈ Proof. This is a routine exercise in elementary real analysis. ￿

Proof of Theorem 1. Assume without loss of generality that ￿ = 1 (if not, rescale). Fix r 4, and for each n 0,1,2,... let T be a subset of T with cardinality N r n such that the r ≥n balls = n ( − ) − centered at the points of Tn cover T . Clearly, the set D = n 0Tn is a countable dense subset− of T . The plan will be to use Borel-Cantelli to show that with∪ probability≥ one the restriction of the Gaussian process to the index set D is uniformly continuous; Lemma 2.1 will then imply that there is a continuous extension to T . n For each n 0 define Dn Tn ; the sets Dn are nested, and Dn has cardinality Nn = k =0 n 1 N r n ,≥ and so the hypothesis∪ (2.1) implies that ≤ ( + ) ( − )

∞ n n r − An < where An = 2log(4 Nn ). n=0 ∞ ￿ ￿ Consider the events Bn = s,t Dn : Xs Xt An d (s,t ) . 2 {∃ ∈ | − |≥ } Since there are at most Nn pairs of points in Dn , Bonferroni implies that 2 2 n P(Bn ) CNn exp n log 4 logNn C 4− . ≤ {− − }≤ (Here C is any constant such that if Z is a standard Gaussian random variable then P Z > x C exp x 2/2 for all x 0.) This is summable, so Borel-Cantelli implies that with probability{ }≤ one only{− finitely} many≥ of the events Bn occur. In particular, w.p.1 there exists K < (random) ∞ such that n K 1Bn = 0. We will show that this implies that the mapping s Xs is uniformly continuous on≥ D. ￿→ Fix ￿>￿0, and let m m ￿ be such that r n A <￿. Suppose that s,t D are any = ( ) n m − n two points at distance less than r m . Let l 0 be≥ the smallest nonnegative integer∈ such that − ≥￿ s,t Dm+l . Then there is a chain of points ∈ s = sk Tn+k ,sk 1 Tn+k 1,...,s0 Tn , ∈ − ∈ − ∈ t = tl Tn+l ,tl 1 Tn+l 1,...,t0 Tn ∈ − ∈ − ∈ connecting s to t such that at every link of the chain

n r +2 d (sr ,sr 1) 4− − , − ≤ n r +2 d (tr ,tr 1) 4− − , and − ≤ n+2 d (s0,t0) 4− . ≤ If m K (as it will be eventually as ￿ 0) then along any link (u ,v ) of these chains Xu Xv < A j d (≥u ,v ), where j is the depth of the→ link. It then follows that | − |

Xs Xt < 3￿. | − | Thus, the restriction of the Gaussian process to D is, with probability one, uniformly continuous. ￿ 6 STEVEN P.LALLEY

k Corollary 2.2. If T is a compact subset of ￿ and if R(s,t ) is a symmetric, Lipschitz continuous, nonnegative definite kernel on T T , then there is a centered Gaussian process Xt t T with con- ∈ tinuous sample paths and covariance× function R(s,t ). { } Proof. Exercise. ￿

1 Corollary 2.3. Let T be a compact interval in ￿ and assume that R(s,t ) satisfies the hypothe- ses of Corollary 2.2. If R is twice continuously differentiable, and if the mixed second partials are Lipschitz continuous on T T , then the centered Gaussian process Xt t T with covariance func- tion R s,t has a differentiable× version, and the derivative process {X } ∈ is a centered Gaussian ( ) t￿ t T process with covariance function { } ∈ ∂ 2R (2.3) cov(X ,X )= (s,t ). s￿ t￿ ∂ s ∂ t

Proof (Sketch). Define a Gaussian process on T T (t ,t ) t T by × \{ } ∈ Xt Xs Ys,t . = t − s − The hypotheses on R guarantee that the covariance function of the process Ys,t has a Lipschitz continuous extension to T T , so the preceding corollary implies that Ys,t extends{ } continuously to all of T T . × ￿ × 2.2. Maximum of a Gaussian process. Theorem 1 implies that if the metric entropy criterion (2.1) holds then the Gaussian process has a version with continuous sample paths, and for this version the maximum must be attained with probability one. A minor variation of the argument used to prove Theorem 1 shows that the expectation of the max is bounded by a multiple of the metric entropy integral.

Theorem 2. There is a universal constant K < such that for any non-degenerate, centered Gaussian process Xt t T and any countable dense∞ subset D T, { } ∈ ⊂ diam(T )/2 (2.4) E supXt K logN (￿)d ￿. t D ≤ ￿0 ∈ ￿ This is stated so that it holds for any version, whether or not it is continuous. However, it is easy to see, given Theorem 1, that the statement is equivalent to the statement that for a contin- uous version, diam(T )/2 E maxXt K logN (￿)d ￿. t T ≤ ∈ ￿0 What is even more striking is that the sup is highly￿ concentrated around its mean. This is an easy consequence of the Gaussian concentration inequality proved earlier in the course.

Theorem 3. Let Xt t T be a centered Gaussian process on a countable set T that is almost surely bounded. Write X{ }sup∈ X .Ifσ2 : sup EX2 < , then E X < and for any u > 0 ∗ = t T t T = t T t ∗ ∈ ∈ ∞ ∞ 2 2 (2.5) P X ∗ EX∗ + u exp u /2σT . { ≥ }≤ {− } GAUSSIAN PROCESSES 7

Proof. Consider first the case where T =[m] is finite, and without loss of generality assume that σ2 1. Then X : X is just a mean-zero Gaussian random vector, and X is the value of T = t t T ∗ the≤ maximum coordinate.{ } ∈ This has finite expectation, because, for instance it is always bounded by the absolute value. Furthermore, if k m is the rank of the covariance matrix then there is k m k a linear transformation A : ￿ ￿ and≤ a standard Gaussian random vector Y in ￿ such that 2 → X = AY . Because σT 1, the components X j of X have variances bounded by 1, and so the rows of A all have length bounded≤ by 1. Consequently, the mapping A : k that takes y k ∗ ￿ ￿ ￿ to the largest coordinate of the m vector Ay is 1 Lipschitz. Hence, the→ inequality (2.5)∈ is an immediate consequence of the Gaussian− concentration− theorem. Now let T be countable, and let T1 T2 be finite subsets of T whose union is T . Set ⊂ ⊂··· µn = E maxt Tn Xt ; since the sets Tn are nested, the sequence µn is nondecreasing. There are ∈ two possibilities: either limµn < or limµn = . But if limµn = then it must be the case that X almost surely, because X ∞ max X ∞, and since (2.5) holds∞ for each T , the maximum ∗ = ∗ t Tn t n of X ∞on T is highly concentrated≥ about µ∈ . (Exercise: Fill in the details.) Thus, if X < almost t n n ∗ surely then limµn < , and by monotone convergence ∞ ∞ EX∗ = E lim maxXt = lim µn < . n t Tn n →∞ ∈ →∞ ∞ Finally, to prove the concentration inequality, observe that for any u > 0,

P X ∗ > EX∗ + u = limP maxXt > EX∗ + u { } { Tn } limP maxXt >µn + u ≤ { Tn } exp u 2/2 . ≤ {− } ￿

3. REPRODUCING KERNEL HILBERT SPACE OF A GAUSSIAN PROCESS

Let R(s,t ) be the covariance function of a centered, non-degenerate Gaussian process Xt t T , 2 2 { } ∈ and let L X be the closure, relative to the L norm, of the vector space consisting of all finite 2 linear combinations of the random variables−Xt . Since L limits of centered Gaussian random − 2 variables are centered Gaussian random variables, every random variable in L X is Gaussian. The reproducing kernel Hilbert space (abbreviated RKHS) associated to the Gaussian process Xt t T 2 { } ∈ is a Hilbert space of real-valued functions on T that is naturally isometric to L X . The isometry between these Hilbert spaces leads to useful spectral representations of the Gaussian process, notably the Karhunen-Loeve representation. The construction of the RKHS is relatively painless, but because the idea behind it is a bit obscure we will first look at another more transparent isometry of Hilbert spaces, the Wiener integral.

3.1. The Wiener Integral. The idea – at least, some version of it – lies behind Wiener’s original 1920 construction of the Wiener process. Wiener realized that if a centered Gaussian process

Wt t [0,1] has the covariance function (1.4) then for s < t and u < v , { } ∈ 1

(3.1) E (Wt Ws )(Wv Wu )= 1(s,t ](x)1(u ,v ](x)dx, − − ￿0 8 STEVEN P.LALLEY

that is, the L2 inner product of Wiener process increments is identical to the L2 inner product of the corresponding time interval indicator functions relative to Lebesgue measure. Because both expectation and Lebesgue integral are linear, it follows that the inner product identity (3.1) extends to all finite linear combinations. Now finite linear combinations of indicator functions of intervals are just step functions, and step functions are dense, relative to the L2 distance, in L2[0,1], so the linear mapping −

k k

(3.2) T ( a i 1(si ,ti ])= a i (W (ti ) W (si )) i =1 i =1 − ￿ ￿ extends to a linear mapping T of L2[0,1] into the space L2(P) of square-integrable random vari- ables, by the following general principle.

Proposition 3.1. Let H be a dense, linear subspace of a Hilbert space H, and let J : H H be 0 0 ￿ a linear isometry mapping H into another Hilbert space H . Then J extends uniquely to→ a linear 0 ￿ isometry J : H H . ￿ → Proof. Exercise. If you don’t know what a Hilbert space is, just assume that H and H are closed ￿ linear subspaces of L2 spaces. ￿

Corollary 3.2. If Wt t [0,1] is a Wiener process defined on some probability space (Ω, ,P) then { } ∈ 2 ￿ 2 the mapping T (1[s,t ])=Wt Ws extends in a unique way to a linear isometry T : L [0,1] L (P). − → The isometry T is called the Wiener isometry or the Wiener integral. Since a Wiener process is a Gaussian process, all linear combinations (3.2) of Wiener increments are mean-zero Gaussian random variables, so all step functions are mapped by T to Gaussians. Since L2 limits of Gauss- ian random variables are Gaussian, it follows that the image T (L2[0,1]) contains− only Gaussian random variables. One of the fundamental theorems of Hilbert space theory is that any Hilbert space has a com- plete orthonormal basis. The Hilbert space L2[0,1] is separable, so it has countable orthonormal bases (for instance, the Fourier basis and the Haar basis). Let ψn n 1 be any such basis. Since ≥ the Wiener integral T is an isometry, the random variables ξn ={ T ψ} n are orthogonal – in partic- ular, uncorrelated, and therefore independent – standard normals, and these form a complete orthonormal basis of T (L2[0,1]). By Parseval’s theorem, any f L2[0,1] has the representation ∈ 1 ∞ f = f ,ψn ψn where f , g = f (x)g (x)dx n=1 ￿0 ￿ ￿ ￿ ￿ ￿ where the series converges in L2 norm, and in particular, for any t [0,1], ∈ t L2 ∞ 10,t = ψn (s )dsψn . n=1 ￿0 ￿ Consequently, for any t [0,1], ∈ t L2 ∞ (3.3) Wt = ξn Ψn (t ) where Ψn (t ) := ψn (s )ds. n=1 ￿0 ￿ GAUSSIAN PROCESSES 9

This expansion is valid for any orthonormal basis ψn n 1. For the Fourier sine basis, the indef- ≥ inite integrals Ψn are cosines, and the expansion reads{ }

∞ 1 (3.4) Wt = ξ0t + n − ξn ￿2sinπnt. n=1 ￿ Wiener’s 1920 insight was that the isometry can be “reversed”, that is, if (Ω, ,P) is a probabil- ity space that supports a sequence ξn n 1 of i.i.d. standard normal random￿ variables then the ≥ 2 2 mapping T ψn = ξn extends to a linear{ isometry} of L [0,1] into L (P), and therefore (3.3) defines a Wiener process. This doesn’t prove that the Wiener process has continuous sample paths, be- cause the convergence of the series (3.3) is only L2 convergence for each t . But Wiener was able to show, by brute force, that for the Fourier basis the series (3.4) actually converges uniformly in t , w.p.1, and so at least for this basis the series defines a Wiener process with continuous paths. Paul Lévy later realized that it is much easier to prove uniform convergence of the series (3.3) for the Haar basis ψn than for the Fourier basis. This is the existence proof given in most standard textbooks today.

1 Exercise 3.1. Let f L2 0,1 be a strictly positive function with L2 norm 1, i.e., such that f s 2 ds [ ] 0 ( ) = 1. For t [0,1] define∈ − ∈ t ￿ 2 F (t )= f (s ) ds, ￿0 1 τ(t )=F − (t )=min s : F (s )=t , and { } Y (t )=T (f 1[0,t ]). Show that Y (τ(t )) is a Wiener process, that is, a mean-zero Gaussian process with covariance function (1.4). Interpretation: f (s )dWs is a Brownian increment whose “volatility” is multiplied by f (s ) . Thus, τ(t ) runs the integral until the total accumulated squared volatility (variance) reaches| |t . t Exercise 3.2. Let g L2 0,1 and define G t g s ds. Note that G t is continuous (why?). [ ] ( )= 0 ( ) ( ) ∈ 1 Also, if W t is a standard Wiener process with continuous paths, then the integral g s W s ds ( ) ￿ 0 ( ) ( ) is well-defined as a Riemann integral. Show that ￿ 1 T (G )= g (s )W (s )ds. ￿0 Also, show that 1 2 2 ET(G ) = G (s ) ds = g (s )g (t )min(s,t )dsdt. 2 ￿0 ￿￿[0,1] 3.2. Reproducing Kernel Hilbert Space. A symmetric, real-valued function R : T T ￿ is said to be a positive definite kernel if for every finite subset F T the F F matrix× (R→(s,t ))s,t F is positive definite. The covariance function of any nondegenerate⊂ | Gaussian|×| | process has this∈ property. Every positive definite kernel R on T induces a canonical metric d on T by

(3.5) d (s,t )= R(t ,t )+R(s,s ) 2R(s,t ). − ￿ 10 STEVEN P.LALLEY

Lemma 3.3. The definition (3.5) specifies a metric on T .

Proof. This is most easily proved by making a connection with Gaussian processes. To prove that d is a metric on T it suffices to prove that it is a metric on any finite subset of T (because the defining properties of a metric – symmetry, positivity, and the triangle inequality – involve only two or three points at a time). Now if F is a finite subset of T then the matrix (R(i , j ))i ,j F is a symmetric, positive definite F F matrix, and hence is the covariance matrix of a centered,∈ nondegenerate Gaussian random| |× vector| | (Xi )i F . By construction, ∈ 2 2 d (i , j ) = E Xi X j , | − | so d is the canonical metric of the Gaussian process (Xi )i F . But the canonical metric of a non- degenerate Gaussian process is a genuine metric. ∈ ￿

Assume for the remainder of this section that R(s,t ) is a positive definite kernel on T . For each s T let Rs : T ￿ be the s section of R, that is, the function defined by Rs (t )=R(s,t ). Let H∈0 be the vector→ space of real-valued− functions on T consisting of all finite linear combinations

a i Rsi . Define an inner product on H0 as follows:

￿ (3.6) a i Rsi , b j Rt j = a i b j R(si ,t j )

This is clearly bilinear, and￿ since￿ R is assumed￿ ￿ to be￿￿ positive definite it defines an inner product (as you should check!). This inner product is designed so that the mapping s Rs is an isometry of T (with the canonical metric) into H0 (with the metric induced by the inner￿→ product (3.6)): in particular, for any s,t T ∈ 2 2 (3.7) Rs Rt = R(t ,t )+R(s,s ) 2R(s,t )=d (s,t ) . ￿ − ￿ − Observe that when R is the covariance function of a centered Gaussian process Xt t T , the inner 2 ∈ product (3.6) makes the mapping T : H0 L (P) defined by { } →

(3.8) T a i Rsi = a i Xsi

a linear isometry. ￿￿ ￿ ￿ The pairing of the inner product (3.6) with the positive definite kernel R is explicitly made so that the kernel R(s,t ) will a reproducing kernel relative to the inner product, in the following sense: for any function ϕ H0 and any t T , ∈ ∈ (3.9) ϕ(t )= Rt ,ϕ .

Definition 3.1. The reproducing kernel Hilbert￿ space H￿ associated with the covariance kernel R is the closure of H0 with respect to the norm induced by the inner product (3.6).

In the special case where R is the covariance function of a centered Gaussian process Xt t T , Proposition 3.1 implies that the linear isometry T defined by (3.8) extends in a unique way{ to} ∈ all of H, since by construction H0 is a dense linear subspace of H.

Lemma 3.4. If T is separable relative to the canonical metric then so is the reproducing kernel Hilbert space H, and if T is compact then set Rs s T is compact, and therefore bounded, in H. { } ∈ GAUSSIAN PROCESSES 11

Proof. By construction, the mapping s Rs is an isometry. Therefore, if T is compact (sepa- rable) then the set Rs s T is compact (separable)￿→ in H. If the set Rs s T is separable, it has a countable dense subset{ } D∈ , and if this is the case then the (countable){ set} ∈ of rational linear com- binations of elements of D is dense in the reproducing kernel Hilbert space H. ￿

Lemma 3.5. The set Rs s T is bounded in the reproducing kernel Hilbert space H if and only if { } ∈ supt T R(t ,t ) < . If this is the case then ∈ ∞ (a) the function (s,t ) R(s,t ) is Lipschitz in each variable s,t ; and (b) for every element ￿→ϕ H the function t Rt ,ϕ is a Lipschitz continuous function on T , with Lipschitz constant∈ bounded by ￿→ϕ and sup-norm bounded by ϕ maxs T Rs . ∈ Thus, the elements ϕ H can be identified￿ ￿ with￿ Lipschitz￿ functions t ￿Rt￿,ϕ on T￿ . ￿ ∈ ￿→ Proof. By the reproducing property (or equivalently the definition of the inner￿ product￿ (3.6)), 2 Rs = Rs ,Rs = R(s,s ). Hence, supt T R(t ,t ) < if and only if supt T Rt < . ￿ ￿ 〈 〉 ∈ ∞ ∈ ￿ ￿ ∞ Assume now that C = sups T Rs < . By the reproducing property and the Cauchy-Schwartz ∈ inequality, for any s1,s2,t T , ￿ ￿ ∞ ∈ R(s1,t ) R(s2,t ) = Rs1 Rs2 ,Rt | − | | − | Rs1 Rs2 Rt ≤￿￿ − ￿￿ ￿￿ = d (s1,s2) Rt ￿ ￿ Cd(s1,s2); ≤ consequently, R(s,t ) is Lipschitz in s , with Lipschitz constant bounded by C . Since R(s,t )= R(t ,s ), it follows that R is Lipschitz in the second variable t as well. Finally, let ϕ be any element of the reproducing kernel Hilbert space H, and consider the mapping t ϕ,Rt . For any two points s,t T , ￿→ ∈ R ,ϕ R ,ϕ R R ,ϕ d s,t ϕ . ￿ ￿ t s = t s ( ) | − | | − |≤ ￿ ￿ This implies that the function t R ,ϕ is Lipschitz, with Lipschitz constant bounded by ϕ . ￿ ￿ ￿ t ￿ ￿ ￿ Furthermore, by Cauchy-Schwartz,￿→ for any t T , ￿ ￿ ￿ ￿ ∈ Rt ,ϕ ϕ sup Rt . | |≤￿ ￿ t T ￿ ￿ ∈ ￿ ￿ ￿ Note: Lest you be confused about the identification of elements ϕ H with functions t Rt ,ϕ : keep in mind that H was constructed as the abstract completion∈ of the inner product￿→ space H0, so (technically) its elements are equivalence classes of Cauchy sequences in H0, not functions￿ ￿ on T (even though the elements of H0 are functions on T ). Nevertheless, when the hypotheses of Lemma 3.5 hold, we will act as if the elements ϕ H are Lipschitz functions and write ϕ(t )= Rt ,ϕ . ∈

Proposition￿ 3.6. If￿ T is compact relative to the canonical metric d of the Gaussian process then for any (countable) orthonormal basis ψn n 1, and any s,t T { } ≥ ∈ (3.10) R(s,t )= Rs ,ψn ψn (t )= ψn (s )ψn (t ), n 1 n 1 ￿≥ ￿≥ and the convergence is uniform for s,t￿ T. ￿ ∈ 12 STEVEN P.LALLEY

Note: This is a variant of Mercer’s theorem. The hypothesis that T is compact ensures that the reproducing kernel Hilbert space is separable, and hence that every orthonormal basis is count- able (or has cardinality T is T is finite). If T is finite the expansion (3.10) follows from the spec- tral theorem for finite symmetric| | matrices, so we will restrict attention to the case where T is compact but infinite.

Proof. By construction, the mapping s Rs is an isometry. Therefore, if T is compact then by Lemma 3.5, R is Lipschitz in each variable.￿→ This implies that the mapping s R(s,s ) is Lipschitz, and that maxs T R(s,s ) < . ￿→ If T is compact∈ and infinite∞ then every orthonormal basis for the reproducing kernel Hilbert space is countable. Let ψn n 1 be any orthonormal basis. By Parseval’s theorem, for any s T ≥ the s section Rs has the{ expansion} ∈ − (3.11) Rs = Rs ,ψn ψn = ψn (s )ψn n 1 n 1 ￿≥ ￿ ￿ ￿≥ where the convergence is in the H norm. (The second identity follows from the reproducing property of the kernel.) Moreover, for− any s T , ∈ 2 2 2 (3.12) R(s,s )= Rs = Rs ,ψn = ψn (s ) . ￿ ￿ n 1 n 1 ￿≥ ￿ ￿ ￿≥ Since T is compact, the function s R(s,s ) is Lipschitz, by Lemma 3.5, and therefore contin- 2 uous. Moreover, the functions ψn (s￿→) are (Lipschitz) continuous, also by Lemma 3.5, and obvi- 1 2 ously nonnegative. Consequently, Dini’s theorem implies that the series n 1 ψn (s ) converges to R(s,s ) uniformly for s T . ≥ ∈ ￿ 2 To complete the proof we will show that the uniform convergence of the series n 1 ψn (s ) ≥ implies that the series n 1 ψn (s )ψn (t ) converges uniformly for (s,t ) T T . The uniform convergence follows by Cauchy-Schwartz,≥ because for any m 1, ∈ × ￿ ￿ ≥

2 2 ψn (s )ψn (t ) ψn (s ) ψn (t ) ; ￿n m ￿ ≤ n m n m ￿ ￿ ￿ ￿ ￿≥ ￿ ￿≥ ￿≥ since ψ s 2 converges￿ uniformly for￿ s T , the right side of the inequality can be made n 1 n ( ) ￿ ￿ ≥ ∈ uniformly small by choosing m large. To see that the limit of the series ψn (s )ψn (t ) is ￿ n 1 R(s,t ), we use the reproducing property of the kernel together with the H convergence≥ (3.11). This implies that −￿

R(s,t )= Rs ,Rt = ψn (s ) ψn ,Rt = ψn (s )ψn (t ). 〈 〉 n 1 n 1 ￿≥ ￿ ￿ ￿≥ ￿ Proposition 3.7. If T is compact relative to the canonical metric then on any probability space that supports a sequence ξn n 1 of independent N (0,1) random variables there exists a centered { } ≥

1 Dini’s theorem states that if f n (t ) is a nondecreasing sequence of nonnegative continuous functions on a compact space T such that f n (t ) f (t ) for every t T , where f is a continuous function, then the convergence is uniform. ↑ ∈ GAUSSIAN PROCESSES 13

Gaussian process Xt t T with covariance function R. For any orthonormal basis ψn n 1 and ∈ ≥ any t T the random{ } variable Xt is the almost sure limit of the series { } ∈ ∞ (3.13) Xt = ξn ψn (t ). n=1 ￿ Proof. If T is separable then so is the reproducing kernel Hilbert space H, by the same argument as in the proof of Proposition 3.6. Consequently, H has a countable orthonormal basis ψn , and for each s T the s section Rs has the expansion (3.11), and has norm Rs given by (3.12). This ∈ − 2 ￿ ￿ implies that the series n 1 ψn (s ) converges to a finite limit. Suppose that ξn n 1 is≥ a sequence of independent N (0,1) random variables. Then for any s T the series { } ≥￿ ∈ Xs := ψn (s )ξn n 1 2 ￿≥ 2 converges in L (P) and almost surely, since the variances ψn (s ) are summable. (The sequence of partial sums is an L2 bounded martingale, and a basic result of martingale theory asserts that any such martingale converges− both a.s. and in L2(P)) By the usual arguments, the limit random variable Xs is centered Gaussian, and the process Xs s T is Gaussian. The covariance function is { } ∈ EXs Xt = E ψn (s )ψm (t )ξn ξm = ψn (s )ψn (t )=R(s,t ). n m n 1 ￿￿ ￿≥ ￿

Theorem 4. (Itô-Nisio) Let Xt t T be a centered Gaussian process with covariance function R and continuous sample paths.{ Assume} ∈ also that T is compact relative to the canonical metric d . Let ψn n 1 be any orthonormal basis for the reproducing kernel Hilbert space, and denote by ≥ ξn ={ T ψ}n the image of ψn under the canonical isometry T (see equation (3.8)). Then the random variables ξn are independent Normal-(0,1), and

(3.14) Xt = ξn ψn (t ). n 1 ￿≥ With probability one, the series converges uniformly in t . The main point of the theorem is the uniform convergence. The hypothesis that the Gaussian process has a version with continuous paths is crucial, as there are Gaussian processes with the property that every version is discontinuous at every point with probability one (see Example 2.1 above). Theorem 4 is somewhat harder to prove than the other results of this section, so I will omit it and refer you to the original paper of Itô and Nisio. Note: Since the random variables ξn 2 are i.i.d. the series ξn = a.s. Therefore, the expansion (3.14) does not converge in H, and in particular the sample paths∞t Xt of the Gaussian process are a.s. not elements of H. (Keep this in mind when reading￿ Example￿→ 3.2 below.) 3.3. Examples of RKHS.

Example 3.1. Consider first the case where the index set T =[m] is finite, so that the covari- ance kernel R(s,t ) is just a symmetric, positive definite m m matrix. Functions on T are just m vectors. To characterize the inner product (3.6), recall that× Rs ,Rt = R(s,t ); this can be ob- tained− by taking the s,t entry of the matrix RR 1R, in matrix notation〈 〉 (with R and R viewed as − s t 14 STEVEN P.LALLEY

column vectors), R s,t RT R 1R . Thus, the inner product (3.6) can be written in matrix form ( )= s − t as T 1 u ,v = u R− v. 〈 〉 Example 3.1 is straightforward, but it is useful to keep in mind when attempting to character- ize/identify the reproducing kernel Hilbert spaces of infinite parameter Gaussian processes. The rough idea is that the inner product (3.6) can be viewed as the quadratic form R 1, where R 1 is, − − in a suitable sense, an inverse of the integral operator associated with the kernel R.

Example 3.2. Recall that the covariance function of the standard Wiener process Wt t [0,1] is ∈ R(s,t )=s t , where denotes min. Each s section Rs has the property that R{s (0)=} 0, so every function∧ in H takes∧ the value 0 at t 0.− Each s section R has derivative R equal to the 0 = s s￿ − indicator function 1[0,s ] of the interval [0,s ], so every ϕ H0 is differentiable at all but finitely many points, and the derivative is bounded. Hence, for every∈ ϕ H0, 1 ∈

Rs ,ϕ = ϕ(s )= Rs￿ (t )ϕ￿(t )dt. ￿0 It follows that the inner product￿ (3.6) on￿ H0 coincides with the usual inner product of derivatives, that is, 1

(3.15) ϕ,ψ = ϕ￿(s )ψ￿(s )ds. ￿0 ￿ ￿ 2 Since linear combinations of intervals 1[0,s ] are dense in L [0,1], it follows that H0 is dense in the space of all functions ϕ : 0,1 with L2 derivative ϕ and ϕ 0 0. Consequently, this [ ] ￿ ￿ ( )= space is the reproducing kernel Hilbert→ space for− Brownian motion. It is sometimes called the Cameron-Martin Hilbert space.

Example 3.3. The Ornstein-Uhlenbeck process Yt t ￿ is best viewed as a process whose param- eter space is the entire real line, since it is stationary.{ } ∈ This space isn’t compact, so Proposition 3.6 does not apply. Nevertheless, the reproducing kernel Hilbert space remains a useful object.

Proposition 3.8. The reproducing kernel Hilbert space H associated with the double-exponential kernel R s,t e t s consists of all functions ϕ L2 whose derivative ϕ is also an element ( )= − (￿) ￿ 2 of L (￿). The inner| product− | of the RKHS is given by∈

(3.16) ϕ,ψ = ϕψ+ ϕ￿ψ￿. ￿ ￿ Proof. Two things must be proved: first,￿ that￿ for the covariance kernel of the Ornstein-Uhlenbeck process the inner product (3.6) reduces to (3.16), and second, that the space H0 is dense in the Hilbert space of functions ϕ L2 with derivative ϕ in L2 . I will skip the second, since we (￿) ￿ (￿) will have no further use for the∈ result. For the first it suffices to prove that for any s,t ￿, ∈

Rs ,Rt = R(s,t )= Rs Rt + Rs￿ Rt￿ . 〈 〉 ￿ ￿ This is an entirely routine calculation (which you should do, to be sure that I haven’t missed any factors of 2 or π). This proves that (3.16) holds for all functions in H0. ￿ GAUSSIAN PROCESSES 15

Note: The density proof isn’t hard – it can be done using standard elementary results of Fourier 1 analysis. The Hilbert space H is the Sobolev space H (￿).

Example 3.4. Consider next a moving average process. Let a k 0 k K be a finite sequence of { } ≤ ≤ real numbers, and let ξn n ￿ be independent standard Gaussian random variables. The moving { } ∈ average process with averaging kernel a k is the centered Gaussian process Xn n ￿ defined by { } { } ∈ K (3.17) Xn = a k ξn k . − k =0 ￿ Lemma 3.9. π K 1 inθ imθ i θ 2 i θ ikθ (3.18) EXn Xm = e e A(e ) d θ where A(e )= a k e . 2π − ￿ π | | k =0 − ￿ This representation of the covariance kernel is called the spectral representation. The density A(e i θ ) 2 is called the spectral density of the moving average process. For many purposes – for instance,| | linear prediction – the spectral representation is more useful than the time-domain representation.

Proof. Since a moving average process is stationary in time, its covariance function R(m,n)= EXn Xm depends only on the difference m n. Set rk = EXk X0; then by the Fourier inversion formula, (3.18) is equivalent to the assertion− that the Fourier series of the sequence rk sums (in 2 i θ 2 L [ π,π]) to A(e ) . That this is so can be seen as follows. First, since the random variables ξn are− i.i.d. standard| Gaussians,|

rk = EXk X0 = E a m ξk m a n ξ n = a k +m a m . m n − − m ￿￿ ￿ Therefore,

ikθ ikθ i (k +m)θ imθ i θ 2 rk e = a k +m a m e = a k +m e a m e − = A(e ) . k k m k m | | ￿ ￿￿ ￿￿ ￿ The spectral representation (3.18) of the covariance kernel can be used to give a corresponding spectral description of the reproducing kernel Hilbert space. Let’s identify a section Rm (n)= rn m with its Fourier series: − ikθ imθ i θ 2 (3.19) Rm rk m e = e A(e ) . ←→ k − | | ￿ The inner product on the reproducing kernel Hilbert space is determined by the law Rm ,Rn = R(m,n)=rm n . Now Lemma 3.9 implies that 〈 〉 − π 1 imθ inθ i θ 2 rm n = e − e A(e ) d θ − 2π ￿ π | | −π 1 imθ i θ 2 inθ i θ 2 i θ 2 = e A(e ) e A(e ) A(e ) d θ . 2π − − ￿ π | | | | | | − ￿ ￿￿ ￿ 16 STEVEN P.LALLEY

Thus, with the identification (3.19), the inner product on the reproducing kernel Hilbert space corresponds to the inner product π 1 i θ 2 (3.20) f , g = f (θ )g (θ ) A(e ) d θ . 2π − ￿ π | | − Therefore, the identification￿ (3.19)￿ extends to a linear isometry of the reproducing kernel Hilbert space onto the Hilbert space Hˆ of all complex functions f : [ π,π] ￿ with real Fourier coeffi- cients that satisfy − → π 2 1 2 i θ 2 f = f (θ ) A(e ) d θ< . 2π − ￿ ￿ ￿ π | | | | ∞ − (Note: Even though this expression involves complex numbers, the Hilbert space Hˆ is actually a real Hilbert space, because functions f (θ ) with real Fourier coefficients satisfy f (θ )=f ( θ ), so the inner product can be rewritten as an integral over the interval [0,π] using only the real− parts of the functions.) Finally, let’s give a complete description of the reproducing kernel Hilbert space itself. Since T = ￿, the sections Rm of the covariance are two-sided sequences, and the law Rm ,Rn = R(m,n)=rm n for the inner product can be expressed directly as 〈 〉 − (3.21) rm n = rm k sk l rn l ; − m n − − − ￿￿ our problem is to determine the appropriate sequence s j . But this can be gotten using the fact that the Fourier transform is an L2 isometry (or alternatively using the Fourier inversion for- mula): in particular, the right side of− (3.21) can be written as π 1 imθ i θ 2 i θ inθ i θ 2 rm k sk l rn l = e A(e ) S(e )e − A(e ) d θ , − − − 2π m n ￿ π | | | | ￿￿ − i θ ikθ where S(e )= k sk e . Here we have used the fact that the Fourier series of the sequence rk is A(e i θ ) 2. This shows that | | ￿ i θ i θ 2 S(e )= A(e ) − , | | and so the coefficients sk in the inner product (3.21) can be recovered by another Fourier inver- sion: π 1 ikθ i θ 2 sk = e A(e ) d θ . 2π − − ￿ π | | − K k Exercise 3.3. Assume that the polynomial A z a k z has K distinct roots ζj all located ( )= k =1 outside the unit disk. Use partial fractions to give explicit formulas for the coefficients sk . What ￿ happens when some of the roots of A(z ) are inside the unit circle?

4. SINGULARITY AND ABSOLUTE CONTINUITY

Let Xt t T be a centered, nondegenerate Gaussian process with (positive definite) covariance ∈ function{ R}(s,t ), and assume that the index set T is compact relative to the canonical metric d . Assume also that the sample paths t Xt are continuous. Given a non-random, continuous ￿→ function f : T ￿ — the “signal” — when can one distinguish between Xt + f t t T and Xt t T ? → { } ∈ { } ∈ GAUSSIAN PROCESSES 17

More precisely, let P and Q be probability measures on the measurable space (ω, ) such that under P the process (Xt )t T is centered Gaussian with covariance function R, while￿ under ∈ Q the process (Xt f t )t T is centered Gaussian with covariance function R. Assume that the ∈ σ algebra is generated− by the random variables (Xt )t T . Recall that there is a linear isometry 2 ∈ JX−: H L (￿P) that maps Rs to the coordinate variable Xs (see equation (3.7)). → Theorem 5. The probability measures P,Q are mutually singular on (Ω, ) unless f is an element of the reproducing kernel Hilbert space H of R. If f H then P,Q are mutually￿ absolutely contin- uous, with likelihood ratio (Radon-Nikodym derivative)∈

dQ 2 (4.1) = exp JX (f ) f /2 . dP { −￿ ￿ } ￿ ￿￿ Remark 2. In the special case where Xt t T is the Wiener process on T =[0,1] this theorem ∈ was discovered and proved by Cameron{ and} Martin around 1945. In this case the isometry JX is the usual Wiener integral. Later Girsanov extended the Cameron-Martin formula (4.2) for the likelihood ratio to certain random functions f given as stochastic integrals.

Before turning to the proof of Theorem 5 let’s look at the related but at least superficially sim- pler problem of distinguishing signal plus noise from noise for . Here T = ￿, and the covariance function is the simplest of all possible covariance functions, to wit, R(i , j )=δi ,j : thus, the Gaussian process (Xn )n ￿ consists of independent, identically distributed standard Gauss- ∈ ian random variables. The “signal” is a fixed non-random sequence (a n )n ￿ of real numbers. ∈ Let the probability measures µ and ν be the joint distributions of the sequences (Xn )n ￿ and ∈ (Xn + a n )n ￿, respectively. ∈ 2 Theorem 6. (Kakutani) The measures µ and ν are absolutely continuous if n 1 a n < , and 2 2 ≥ ∞ mutually singular if n 1 a n = .If n 1 a n < then the series n 1 a n yn converges almost surely under µ and ≥ ∞ ≥ ∞ ≥ ￿ ￿ ￿ ￿ d ν 1 2 (4.2) ((yn )n 1)=L = exp a n yn a . d µ 2 n ≥ ∞ ￿n 1 − n 1 ￿ ￿≥ ￿≥ Proof. First some generalities. Let µ,ν be two probability measures on , , where is (￿∞ ) the usual Borel field, that is, the smallest σ algebra that contains all cylinder￿∞ sets. For￿ each∞ − n ￿ denote by n the σ algebra generated by the first n coordinate variables. Suppose that the∈ restrictions of￿µ and ν −to n are mutually absolutely continuous for each n (this is the case, for instance, when µ and ν are￿ the laws of any two nondegenerate Gaussian processes, or when µ and ν are product measures where each factor is absolutely continuous relative to Lebesgue measure on ￿). Then the sequence d ν L n : = d µ n ￿ ￿￿ of likelihood ratios is a positive martingale under µ relative to the filtration n n 1, and hence ≥ converges to a nonnegative limit L almost surely. (EXERCISE: Verify that {L￿n is} a martingale.) There are two interesting cases: ∞

(a) EµL = 1, and ∞ (b) µ L = 0 = 1. { ∞ } 18 STEVEN P.LALLEY

These are not the only possibilities, in general, but in both the Gaussian cases and in the case of product measures either (a) or (b) will hold. In case (a) the martingale L n is closed, that is, 1 L n = E (L n ) for each n and L n L in L . In this case the measures µ and ν are absolutely ∞ ∞ continuous| on￿ , with Radon-Nikodym→ derivative L , because for any F n , ￿∞ ∞ ∈￿ EµL 1F = EµL n 1F = ν(F ) ∞ and so the measures ν and F EµL 1F agree on n 1 n . In case (b), on the other hand, the ∞ ≥ measures µ and ν are singular￿→ on n . To see this, first∪ note￿ that if L n 0 a.s. under µ then there is a sequence of constants ￿n > 0 such￿ that such that lim￿n = 0 and such→ that with µ probability 1, − µ(L n ￿n eventually)=1. ≤ (EXERCISE: Prove this.) Define τm to be the first time n m that L n ￿n , and let Fm be the event ≥ ≤ τm < . These events are nested: Fm Fm+1 for each m. By construction, µ(Fm )=1 for every m{ 1,∞ and} hence µ( m Fm )=1. But for each⊃ m 1, ≥ ∩ ≥ ∞ ∞ ν(Fm )= ν(τm = n)= EµL n 1 τm =n = EµL τm ￿m . { } n=m n=m ≤ ￿ ￿ Consequently, ν( m Fm )=0. ∩ Now to Kakutani’s theorem: Under µ the coordinate variables yn are i.i.d. standard Gaussian, whereas under ν they are independent Gaussians with a n and variances 1. Thus, the likelihood ratio on n is ￿ 2 L n = exp An z n An /2 where n { − } n 2 2 An = a j and z n = a j yj /An . j =1 j =1 ￿ ￿ 2 2 If A = a j < then the series j 1 a j yj converges almost surely and in L (µ), and the limit ≥ random variable∞ is centered Gaussian with variance A. Consequently, L n L where half of the ￿ ￿2 → ∞ theorem. Suppose then that a j = . Under µ the random variable z n is standard Gaussian, and so ∞ ￿ 2 lim µ An z n > A /4 0. n n = →∞ { } This implies that L n 0 in µ probability. But the martingale convergence theorem implies that L n L almost surely→ under−µ, and so it follows that L = 0 a.s. µ. → ∞ ∞ ￿

Proof of Theorem 5. Since T is compact it has a countable dense subset D = s1,s2,... . The strategy of the proof will be to reduce the problem to the situation covered by Kakutani’s{ theorem}

by applying Gram-Schmidt to the random variables Xs1 ,Xs2 ,.... Assume then that under P the process Xt t T is centered Gaussian with covariance function ∈ R(s,t ), and that under Q the process Xt { f t }t T is centered Gaussian with covariance function ∈ R(s,t ). Assume also that under both{P and− Q}the sample paths t Xt are continuous; this im- plies that the σ-algebra generated by the entire family Xt t T ￿→coincides with the σ algebra ∈ generated by Xt t D ,￿ and so { } − ￿ { } ∈ = σ n 1 n ￿ ∪ ≥ ￿ ￿ ￿ GAUSSIAN PROCESSES 19

where n = σ((Xsi )i n ). Let ξ1,ξ2,... be the random variables obtained by applying the Gram- ￿ ≤ Schmidt algorithm to the sequence Xs1 ,Xs2 ,..., that is, 1/2 ξ1 = Xs1 /R(s1,s1) ,(4.3)

(Xs2 R(s2,s1)Xs1 /R(s1,s1) ξ , 2 = − 2 1/2 (R(s1,s1) R(s2,s1) /R(s1,s1)) etc. −

Since we have assumed that the covariance kernel R(s,t ) is positive definite, the random vari- 2 ables Xsi i 1 are linearly independent in L (P), and hence the Gram-Schmidt mapping (4.4) is well-defined{ } ≥ (that is, there are no divisions by 0). Observe that for each n 1 the Gram-Schmidt equations (4.4) can be written in matrix form: ≥

ξ1 X1 ξ2 X2 (4.4)  .  = Tn  . , . .     ξ  X   n   n      where for each n the matrix Tn is lower triangular and invertible. Consequently, for each n,

n = σ((ξi )i n ). ￿ ≤ Thus, the random variables ξn generate the σ algebra . Note also that the entries of the ma- trices Tn are determined solely by the entries R−(si ,s j ) with￿ i , j n, and that for each n the n n ≤ × minor of Tn+1 is Tn , that is,

b1,1 00 0 b2,1 b2,2 0 ··· 0 Tn =  . . . ···. .  ......   b b b b   n,1 n,2 n,3 n,n   ···  for fixed scalars bi ,j .   By construction, the random variables (ξn )n 1 are independent, identically distributed stan- dard Gaussians under P. UnderQ, however, they≥ are independent Gaussians with common vari- ance 1, but with means a n = EQ ξn depending on the function f . These means can be computed by taking EQ expectations on both sides in (4.5): this implies that for each n, − a 1 f s1

a 2 f s2 (4.5)  .  = Tn  . . . .     a  f   n   sn      In particular, for each n the vector (a i)i n is gotten by applying the Gram-Schmidt transforma- ≤ tion to the vector (f si )i n . ≤ 2 Suppose that f H. Let JX : H L (P) be the canonical linear isometry from the reproducing 2 kernel Hilbert space∈ H into L (P→) (recall that this is determined by the identification JX (Rs )= 1 Xs ), and let ψn = JX− ξn . Since JX is a linear isometry, the sequence (ψn )n 1 constitutes an ≥ orthonormal basis of H, and the functions ψn are obtained from the sequence Rs1 ,Rs2 ,... by 20 STEVEN P.LALLEY

Gram-Schmidt, and for each n the linear transformation taking (Rsi )i n to (ψi )i n is the same as ≤ ≤ that taking (Xsi )i n to (ξi )i n . This implies that ≤ ≤ a n = f ,ψn . Since f f ,ψ ψ , it follows that a 2 f 2 < . Thus, Kakutani’s theorem im- = n 1 n n n￿1 n =￿ plies that in this≥ case the measures P and Q are≥ absolutely￿ ￿ continuous,∞ and the likelihood ratio is ￿ ￿ ￿ ￿

dQ 1 2 exp a n ξn a . dP = 2 n ￿n 1 − n 1 ￿ ￿ ￿￿ The formula (4.2) follows, because ￿≥ ￿≥ 2 2 JX (f )= a n ξn and f = a n . ￿ ￿ Now suppose that f H. For each￿n 1, let f n H be￿ the unique linear combination in ￿∈n ≥ ∈ span(Rsi )i n such that f (si )=f (si ) for each i n. There is such a linear combination, because ≤ for each n the matrix R(si ,S j )i ,j n is positive≤ definite, and hence invertible.) For each n the ≤ means (a i )i n are related to the entries (f si )i n by (4.6), and so ≤ ≤ f n a 1 s1 a f n 2 s2  .  = Tn  . . . .    n  a n  f     sn  n     But since f is a linear combination of the functions (Rsi )i n , it follows that for 1 m n, ≤ n ≤ ≤ a m = f ,ψm ￿ ￿ and so n 2 n 2 a j = f . j =1 ￿ ￿ 2 ￿ Now if it were the case that a < then g a n ψn H, and the last two equations n∞=1 n = n 1 would imply that for each n the function∞ f n is the orthogonal≥ projection∈ of g on the subspace ￿ ￿ spanned by the functions (Rsi )i n . But this would imply that g (si )=f (si ) for every i 1, and ≤ since the sequence (si )i 1 is dense in T it would follow that f = g everywhere. Since≥ we have assumed that f H this≥ cannot be the case. Therefore, a 2 , and Kakutani’s theorem n∞=1 n = implies that P and￿∈ Q are mutually singular. ∞ ￿ ￿

UNIVERSITY OF CHICAGO,DEPARTMENT OF , USA