Quick viewing(Text Mode)

Gaussian Processes

Gaussian Processes

VAG002-

Provided the  is nonsingular, the Gaussian processes random vector X has a Gaussian probability density given by The class of Gaussian processes is one of the most n/2 1/2 widely used families of stochastic processes for mod- fX x D 2 det  eling dependent observed over time, or space, ð exp  1 x  m 01 x  m 1 or time and space. The popularity of such processes 2 stems primarily from two essential properties. First, In environmental applications, the subscript t will a is completely determined by its typically denote a point in time, or space, or space and covariance functions. This property facili- and time. For simplicity, we shall restrict attention to tates model fitting as only the first- and second-order the case of for which t represents time. moments of the process require specification. Second, In such cases, the index set T is usually [0, 1 for solving the prediction problem is relatively straight- time series recorded continuously or f0, 1,...,g for forward. The best predictor of a Gaussian process time series recorded at equally spaced time units. at an unobserved location is a linear function of The mean and covariance functions of a Gaussian the observed values and, in many cases, these func- process are defined by tions can be computed rather quickly using recursive formulas.  t D EXt 2 The fundamental characterization, as described below, of a Gaussian process is that all the finite- and dimensional distributions have a multivariate normal  s, t D cov Xs,Xt 3 (or Gaussian) distribution. In particular the distribu- tion of each observation must be normally distributed. respectively. While Gaussian processes depend only There are many applications, however, where this on these two quantities, modeling can be diffi- assumption is not appropriate. For example, con- cult without introducing further simplifications on sider observations x1,...,xn,wherext denotes a 1 the form of the mean and covariance functions. or 0, depending on whether or not the air pollution The assumption of stationarity frequently provides on the tth day at a certain site exceeds a govern- the proper level of simplification without sacrificing ment standard. A model for there data should only much generalization. Moreover, after applying ele- allow the values of 0 and 1 for each daily obser- mentary transformations to the data, the assumption vation thereby precluding the normality assumption of stationarity of the transformed data is often quite imposed by a Gaussian model. Nevertheless, Gaus- plausible. sian processes can still be used as building blocks A Gaussian time series fXtg is said to be station- to construct more complex models that are appro- ary if priate for non-Gaussian data. See [3–5] for more on modeling non-Gaussian data. 1. m t D EXt D  is independent of t,and 2.  t C h, t D cov XtCh,Xt is independent of t for all h. Basic Properties For stationary processes, it is conventional to express A real-valued fXt,t2 Tg,where the  as a function on T instead T is an index set, is a Gaussian process if all the of on T ð T. That is, we define  h D cov XtCh,Xt finite-dimensional distributions have a multivariate and call it the function of the process. . That is, for any choice of dis- For stationary Gaussian processes fXtg,wehave tinct values t ,...,t 2 T, the random vector X D 1 k 3. X ¾ N ,  0 for all t,and X ,...,X 0 has a multivariate normal distribu- t t1 tk 4. X ,X 0 has a bivariate normal distribution tion with mean vector m D EX and tCh t with covariance matrix  D cov X, X , which will be denoted by  0  h X ¾ N m,  h  0 VAG002-

2 Gaussian processes for all t and h. where fZtg is a sequence of independent and identi- A general stochastic process fXtg satisfying con- cally distributed (iid) normal random variables with 2 ditions 1 and 2 is said to be weakly or second-order mean 0 and  , f jg is a sequence of square stationary. The first- and second-order moments of summable coefficients with 0 D 1, and fVtg is a weakly stationary processes are invariant with respect deterministic process that is independent of fZtg.The to time translations. A stochastic process fXtg is Zt are referred to as innovations and are defined by strictly stationary if the distribution of Xt1 ,...,Xtn Zt D Xt  E XtjXt1,Xt2,...). A process fVtg is is the same as Xt1Cs ,...,XtnCs for any s.Inother deterministic if Vt is completely determined by its words, the distributional properties of the time series past history fVs,s

An autocovariance function  Ð has the properties: 1 Xt D jZtj 6 1.  0 ½ 0, jD0 2. j h jÄ 0 for all h, 3.  h D  h ,i.e. Ð is an even function. The autocovariance of fXtg has the form have another fundamental prop- 1 erty, namely that of non-negative definiteness,  h D j jCh 7 n jD0 ai ti  tj aj ½ 0 4 i,jD1 The class of autoregressive (AR) processes, and its extensions, autoregressive moving-average (ARMA) for all positive integers n, real numbers a1,...,an, and t1,...,tn 2 T. Note that the expression on the processes, are dense in the class of Gaussian linear processes. A Gaussian AR(p) process satisfies the left of (4) is merely the variance of a1Xt1 CÐÐÐC recursions anXtn and hence must be non-negative. Conversely, if a function  Ð is non-negative definite and even, then it must be an autocovariance function of some Xt D !1Xt1 CÐÐÐC!pXtp C Zt 8 stationary Gaussian process. 2 where fZtg is an iid sequence of N 0, ran- Gaussian Linear Processes dom variables, and the polynomial ! z D 1  !1z  p ÐÐÐ!pz has no zeros inside or on the unit cir- If fXt,tD 0, š1, š2,...,g is a stationary Gaussian cle. The AR(p) process has a linear representation process with mean 0, then the Wold decomposition (6) where the coefficients are found as functions of allows Xt to be expressed as a sum of two indepen- the !j (see [2]). Now for any Gaussian linear process, dent components, there exists an AR(p) process such that the difference 1 in the two autocovariance functions can be made arbi- Xt D jZtj C Vt 5 trarily small for all lags. In fact, the autocovariances jD0 can be matched up perfectly for the first p lags. VAG002-

Gaussian processes 3

Prediction subset consisting of linear independent variables. The covariance matrix of this prediction subset will be Recall that if two random vectors X1 and X2 have a nonsingular. A mild and easily verifiable condition joint normal distribution, i.e. for ensuring nonsingularity of n for all n is that  h ! 0ash !1with  0 >0 (see [1]). X m   1 ¾ N 1 , 11 12 While (12) and (13) completely solve the pre- X m   2 2 21 22 diction problem, these equations require the inver- n ð n and 22 is nonsingular, then the conditional distri- sion of an covariance matrix which may n bution of X1 given X2 has a multivariate normal be difficult and time consuming for large .The distribution with mean Durbin–Levinson algorithm (see [1]) allows one to 0 compute the coefficient vector !n D !n1,...,!nn 1 mX1jX2 D m1 C 1222 X2  m2 9 and the one-step prediction errors vn recursively from !n1, vn1, and the autocovariance function. and covariance matrix 1 The Durbin–Levinson Algorithm X1jX2 D 11  1222 21 10 The key observation here is that the best mean The coefficients !n in the calculation of the one-step square error predictor of X in terms of X prediction error (11) and the mean square error of 1 2 prediction (13) can be computed recursively from the (i.e. the multivariate function g X2 that minimizes 2 equations EjjX1  g X2 jj ,wherejj Ð jj is Euclidean distance)   is E X jX D m j which is a linear function of 1 1 2 X1 X2 n X2. Also, the covariance matrix of prediction error,   1 !nn D  n  !n1,j n  1 vn1 X jX , does not depend on the value of X2.These 1 2 jD1 results extend directly to the prediction problem for       Gaussian processes. !n,1 !n1,1 !n1,n1 Suppose fX ,tD 1, 2,...g is a stationary Gaussian  .   .   .  t  .  D  .   !nn  .  process with mean  and autocovariance function ! ! !  Ð and that based on the random vector consisting n,n1 n1,n1 n1,1 0 2 of the first n observations, Xn D X1,...,Xn ,we vn D vn1 1  !nn 14 wish to predict the next observation XnC1. Prediction for other lead times is analogous to this special case. where !11 D  1 / 0 and v0 D  0 . Applying the formula in (9), the best one-step-ahead If fXtg follows that AR(p) process in (8), then predictor of XnC1 is given by the recursions simplify a great deal. In particular, for n>p, the coefficients ! D ! for j D 1,...,p nj j XnC1: D E XnC1jX1,...,Xn D  C !n1 Xn   and !nj D 0forj>pgiving

CÐÐÐC!nn X1   11 XnC1 D !1Xn CÐÐÐC!pXnp 15 where 2 0 1 with vn D  . !n1,...,!nn D  n 12 n The sequence of coefficients f!jj,j½ 1g is called the partial function and is a useful tool n D cov Xn, Xn ,andn D cov XnC1, Xn D  1 ,..., n 0. The mean square error of predic- for model identification. The partial autocorrelation at tion is given by lag j is interpreted as the correlation between X1 and XjC1 after correcting for the intervening observations 0 1 X ,...,X . Specifically, ! is the correlation of vn D  0  nn n 13 2 j jj the two residuals obtained by regression of X1 and These formulas assume that n is nonsingular. If XjC1 on the intermediate observations X2,...,Xj. n is singular, then there is a linear relationship Of particular interest is the relationship between !nn among X1,...,Xn and the prediction problem can and the reduction in the one-step mean square error then be recast by choosing a generating prediction as the number of predictors is increased from n  1 VAG002-

4 Gaussian processes to n. The one-step prediction error has the following is a useful representation of the likelihood in terms of decomposition in terms of the partial autocorrelation the one-step prediction errors and their mean square function: errors. By the form of Xn, we can write 2 2 vn D  0 1  !11 ÐÐÐ 1  !nn 16 Xn  Xn D AnXn 18 For a Gaussian process, XnC1  XnC1 is normally where An is a lower triangular square matrix with distributed with mean 0 and variance vn. Thus, ones on the diagonal. Inverting this expression, 1/2 we have XnC1 š z1˛/2vn Xn D Cn Xn  Xn 19 constitute (1  ˛) 100% prediction bounds for the observation X ,wherez is the (1  ˛/2) where Cn is also lower triangular with ones on the nC1 1˛/2  j quantile of the standard normal distribution. In diagonal. Since Xj E Xj X1,...,Xj1 is uncor- related with X1,...,Xj1, it follows that the vec- other words, XnC1 lies between the bounds XnC1 š 1/2 tor Xn  Xn consists of uncorrelated, and hence z  vn with probability 1  ˛. 1 ˛/2 independent, normal random variables with mean 0 and variance vj1, j D 1,...,n. Taking covari- Estimation for Gaussian Processes ances on both sides of (19) and setting Dn D diagfv0,...,vn1g,wefindthat One of the advantages of Gaussian models is that 0 an explicit and closed form of the likelihood is n D CnDnCn 20 readily available. Suppose that fX ,tD 1, 2,...,g t and is a stationary Gaussian time series with mean  and autocovariance function  Ð . Denote the data 0 1 0 1 Xn  1 n Xn  1 D Xn  Xn Dn vector by Xn D X1,...,Xn and the vector of one- 0 n 2 step predictors by Xn D X1,...,Xn ,whereX1 D Xj  Xj Xn  Xn D 21  and Xj D E XjjX1,...,Xj1 for j ½ 2. If n v jD1 j1 denotes the covariance matrix of Xn, which we assume is nonsingular, then the likelihood of Xn is It follows that det n D v0v1 ...vn1 so that the n/2 1/2 likelihood reduces to L n, D 2 det n n/2 1/2 1 0 1 L n, D 2 v0v1 ...vn1 ð exp  2 Xn  1 n Xn  1   n 2 17 1 Xj  Xj 0   where 1 D 1,...,1 . Typically, n will be express- exp 22 2 vj1 ible in terms of a finite number of unknown param- jD1 eters, ˇ ,...,ˇ , so that the maximum likelihood 1 r The calculation of the one-step prediction errors and estimator of these parameters and  arethoseval- their mean square errors required in the computation ues that maximize L for the given dataset. Under of L based on (22) can be simplified further for mild regularity assumptions, the resulting maximum a variety of time series models such as ARMA likelihood estimators are approximately normally dis- processes. We illustrate this for an AR process. tributed with covariance matrix given by the inverse of the Fisher information. In most settings, direct-closed-form maximization Gaussian Likelihood for an AR(p) Process of L with respect to the parameter set is not achiev- able. In order to maximize L using numerical meth- If fXtg is the AR(p) process specified in (8) with ods, either derivatives or repeated calculation of the mean , then one can take advantage of the simple function are required. For moderate to large sample form for the one-step predictors and associated mean sizes n, calculation of both the determinant of n and square errors. The likelihood becomes the quadratic form in the exponential of L can be dif- 2  np /2  np ficult and time consuming. On the other hand, there L !1,...,!p,, D 2  VAG002-

Gaussian processes 5   n 2 where, for j>p, Xj D  C !1 Xj1   CÐÐÐC  1 Xj  Xj  p/2 ð exp  2 ! X   are the one-step predictors. The 2 2 p jp1 jDpC1   likelihood is a product of two terms, the conditional p 2 density of Xn given Xp and the density of Xp.Often, 1/2  1 Xj  Xj  ð v v ...v  exp  just the conditional maximum likelihood estimator 0 1 p 1 2 v jD1 j1 is computed which is found by maximizing the first 23 term. For the AR process, the conditional maximum

20

19

18

17 Temperature

16

15

1900 1920 1940 1960 1980

Figure 1 Average maximum temperature, 1885–1993. Regression line is 16.83 C 0.008 45t

2

1

0

−1

Temperature for september for Temperature −2

−3

−2 −10 1 2 Quantiles of standard normal

Figure 2 QQ plot for normality of the innovations VAG002-

6 Gaussian processes likelihood estimator can be computed in closed form. significance of a nonzero slope of the line. Without modeling the dependence in the residuals, the slope Example This example consists of the average would have been deemed significant using classical maximum temperature over the month of September inference procedures. By modeling the dependence for the years 1895–1993 in an area of the US whose in the residuals, the evidence in favor of a nonzero vegetation is characterized as tundra. The time series slope has diminished somewhat. The QQ plot of the x1,...,x99 is plotted in Figure 1. Here we investigate estimated innovations is displayed in Figure 2. This the possibility of the data exhibiting a slight linear plot shows that the AR(1) model is not far from being trend. After inspecting the residuals from fitting a Gaussian. Further details about inference procedures regression line to the data, we entertain for regression models with time series errors can be a time series model of the form found in [2, Chapter 6].

Xt D ˇ C ˇ t C Wt 24 0 1 References where fW g is the Gaussian AR(1), t [1] Brockwell, P.J. & Davis, R.A. (1991). Time Series: The- ory and Methods, 2nd Edition, Springer-Verlag, New D C Wt !1Wt1 Zt 25 York. 2 [2] Brockwell, P.J. & Davis, R.A. (1996). Introduction to and fZtg is a sequence of iid N 0, random vari- Time Series and Forecasting, Springer-Verlag, New ables. After maximizing the Gaussian likelihood over York. 2 the parameters ˇ0,ˇ1,!1,and , we find that the [3] Diggle, Peter J., Liang, Kung-Yee & Zeger, Scott L. maximum likelihood estimate of the mean function is (1996). Analysis of Longitudinal Data, Clarendon Press, 16.83 C 0.008 45t. The maximum likelihood parame- Oxford. ters of ! and 2 are estimated by 0.1536 and 1.3061, [4] Fahrmeir, L. & Tutz, G. (1994). Multivariate Statis- 1 tical Modeling Based on Generalized Linear Models, respectively. The maximum likelihood estimates of Springer-Verlag, New York. ˇ0 and ˇ1 can be viewed as generalized least squares [5] Rosenblatt, M. (2000). Gaussian and Non-Gaussian estimates assuming that the residual process follows Linear Time Series and Random Fields, Springer-Verlag, the estimated AR(1) model. The resulting standard New York. errors of these estimates are 0.277 81 and 0.004 82, respectively, which provides some doubt about the RICHARD A. DAV I S