A Note on Kriging and Gaussian Processes
Total Page:16
File Type:pdf, Size:1020Kb
A NOTE ON KRIGING AND GAUSSIAN PROCESSES Mohammad Shekaramiz, Todd K. Moon, and Jacob H. Gunther Electrical and Computer Engineering Department and Information Dynamics Laboratory Utah State University [email protected], [email protected], [email protected] 1. INTRODUCTION TO GAUSSIAN PROCESSES [1–4] • Training data: input-output pairs D = f(xi; yi)ji = 1; 2; :::; ng • Each input is a vector xi of dimension d • Each target is a real-valued scalar yi = f(xi) for some unknown function f, possibly corrupted by noise • Collect inputs into matrix X 2 Rd×n and the correspond- ing outputs into vector y 2 Rn, resulting in the input- output training set D := fX; yg • Assumption: There exit an underlying process for the un- known function f Figure 1: Example showing a graphical representation of two • Objective: Predict the output y? at the input test x? i.e., input-output training data and one test data [1]. y? = f(x?) using the training data set D • Restate the Objective: Infer distribution over function f given the training data i.e., p(fjD = fX; yg) • How does it work? GP assumes that p(f(x1); :::; f(xn)) is jointly Gaussian with some mean µ(X) and covariance • Then use this information to make predictions for y? corre- Σ(X) given by Σij = κ(xi; xj ), where k is a P.D. kernel sponding to new (test) input data X? i.e., computing function i.e., p f(x1); :::; f(xn) ∼ N µ(X); Σ(X) Z p y?jX?;D = fX; yg) = p(y?jf; X? p(fjD)df • Key idea: If xi and xj are by the kernel similar, we expect that the output of the function at those points to be similar, as well • Remark: Need direct representation for function f via p(fjD) rather than the parametric description p(θjD) • Graphical illustration of a Gaussian Process: As an ex- ample, a Gaussian Process for 2 training data points and 1 • In other words, we want to find a way to perform Bayesian testing point is illustrated in Fig. 1, which is a mixture of inference over “functions” directed and undirected graphical model describing • Solution: “Gaussian Process” Q p(y; fjX) = N fjm(X);K(X) i p yijf(xi) 1.1. Definition of Gaussian Process Modeling 1.2. GPs for Regression • GP is a way of defining distribution over functions • Let the prior on the regression function be a GP, denoted by It defines a prior over function f and then based on Bayesian inference convert it into a posterior once some data is ob- f(x) ∼ GPm(x); κ(x; x0); served • Gaussian Process is a stochastic process whose realizations consist of a collection random variables (r.v.s) associated where m(x) = E[f(x)] is the mean function κ(x; x0) = with every point in range of either time or space E[(f(x)−m(x))(f(x0)−m(x0))T ] is the covariance func- In such a process each r.v. has a normal distribution tion • A GP is completely specified by a mean function and a pos- • For any finite set of n data points, this process defines a itive definite (p.d.) covariance function joint Gaussian distribution This property facilitates model fitting as only the first- and second-order moments of the process are required to be p(fjX) = N (fjµ;K) specified. where Kij = κ(xi; xj ) and µ = [m(x1); :::; m(xn)] 1.2.1. Prediction Using GP Modeling for Noise-Free Ob- servations • In this case, we expect the GP to predict f(x) for a value at location x that has already seen, to return the answer f(x) with no uncertainty • In other words, GP-regression model should act as an inter- polator • The joint distribution between the training and test data has the following form f µ KK? ∼ N ; T ; f? µ? K? K?? Figure 2: Finding z(u∗) based on spatial interpolation where f contains the set of output training data and f? are the predictions that are after at the input test data points 1.2.3. Prediction Using Semi-Parametric GP Modeling • Inference • Semi-parametric GP combines a parametric model with a non-parametric model p(f jX ; X; f) = N (f jµ ; Σ ); ? ? ? ? ? • It uses a linear model for the mean of the process where f(x) = βT φ(x) + r(x); where r(x) ∼ GP(0; κ(x; x0)) models the residuals and µ = µ(X ) + KT K−1f − µ(X) ? ? ? assuming that β ∼ N(b;B): Σ = K − KT K−1K ? ?? ? ? f(x) ∼ GPφ(x)T b; κ(x; x0) + φ(x)T Bφ(x0) • Posterior predictive density • Prediction: µ? can be used as the prediction for f? p(f?jX?; X; y) ∼ N (µf? ; Σf? ); 1.2.2. Prediction Using GP Modeling for Noisy Observa- where tions 8µ = ΦT β¯ + KT K−1(y − ΦT β¯) > f? ? ? y > T −1 T −1 −1 T −1 <Σf = K?? − K? Ky K? + R (B + ΦKy Φ ) R • Suppose the ith input-output training data is (Xi; yi), where ? ¯ −1 T −1 −1 −1 −1 yi = f(xi) + " and >β = (ΦKy Φ + B ) (ΦKy y + B b) > −1 :R = Φ? − ΦKy K?: 2 " ∼ N (0; σy); 8i = 1; :::; n: 2. KRIGING [5–8] • This model does not necessarily perform interpolation, but it must come close to the observed data • Illustrating example: Fig. (2) • Define • Need a model for spatial dependency to interpolate at un- 2 sampled locations Ky =: cov(yjX) = K + σyIN ; where K is the covariance matrix of the training data 2.1. Kriging • Joint density of the observed and the test data • Spatial interpolation method based on statistical model • Common types of Kriging y Ky K? ∼ N 0; T ; Simple Kriging (SK) f? K? K?? Ordinary Kriging (OK) where zero-mean case is considered for simplicity Kriging with a trend (KT) • Posterior predictive density Factorial Kriging (FK) Co-Kriging p(f?jX?; X; f) = N (f?jµ?; Σ?); • Kriging is based on GP modeling • Problem Description: where Estimating the value of a continuous attribute Z at any un- T −1 sampled location u∗ using only the available training data- µ? = K K y ? y pairs D = f(uα; z(uα))jα = 1; :::; n(α)g over the study T −1 Σ? = K?? − K? Ky K? area 2.2. Kriging Structure b) Ordinary Kriging (OK): Accounts for local fluc- tuations of the mean by limiting the domain of stationarity • Kriging estimators are based on linear regression estimation to local neighborhood W (u) of the R.V. Z(u∗) m(u0) = m; unknown 8u0 2 W (u) n(α) c) Kriging with a trend model(KT): Unknown local ∗ X Z (u∗) − µ(Z(u∗)) = λα(u)[Z(uα) − µ(uα)]; mean m(u0) smoothly varies within each local neighbor- α=1 hood W (u), hence over the entire under-study area Trend component is modeled as a linear combination where of functions fk(u) of the coordinates Z(uα): Random variable at point u K z(uα): Realization of Z(uα) at point u X m(u0) = ak(u0)fk(u0); ak(u0) ≈ ak; 8u0 2 W (u) λα(u): Kriging weight assigned to data z(uα) to compute k=0 z(u∗) µ(Z(uα)): Expected value of R.V. Z(uα) Remark: Functions fk(u0) may be directly suggested by the physics of the problem • Remark: The number of data, n(u), involved in the esti- Example: A linear trend model mation may change from one location to another. In prac- tice, only the n(u) data closest to the location u∗ are re- m(u) = m(ux; uy) = a0 + a1ux + a2uy tained. 2.4. Simple Kriging (SK) Model • Kriging interprets the unknown value z(u∗) and data val- ues z(uα) as realizations of R.V, Z(u∗) and Z(uα), re- • Model the trend component m(u) as a known stationary spectively mean i.e., µ(Z(u))) = m(u) = m • Goal: Minimizing the estimation error i.e., error variance 2 σE (u) under unbiasedness assumption of the estimator 2 ∗ minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) ∗ s.t. Ef(Z (u∗) − Z(u∗))g = 0 2 ∗ minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) • In Simple Kriging the constraint condition is already satis- s.t. Ef(Z∗(u ) − Z(u ))g = 0 ∗ ∗ fied i.e., SK estimator is unbiased ∗ • Decomposing Z(u), Based on semi-parametric GP EfZSK (u∗) − Z(u∗)g = m − m = 0 Z(u) := R(u) + µ(Z(u)); • Unconstraint minimization problem 2 ∗ where R(u) is the residual component and µ(Z(u)) is the minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) trend component • Estimator error Z∗(u ) − Z(u ) can be viewed as a linear • Modeling the components ∗ ∗ combination of residual variables R(uα) and R.V. R(u) 1) Residual component: Modeled as a stationary GP with zero mean as n(α) ∗ X Z (u ) − Z(u ) = λ (u)(Z(u ) − m) − R(u) ∗ ∗ α α R(u) ∼ GP 0;CR(h) α=1 In other words, R(uα) := Z(uα) − m n(α) EfR(u)g = 0 ∗ X R (u) := λ (u)(Z(u ) − m) CovR(u);R(u + h) = EfR(u)R(u + h)g α α α=1 ∗ ∗ 2) Trend component: Z (u∗) − Z(u∗) := R (u) − R(u) EfZ(u)g = µ(Z(u)) := m(u) • Variance error n(u) n(u) 2 X X 2.3. Common Kriging Types σE (u) = λα(u)λβ (u)CR(uα − uβ ) + CR(0) α=1 β=1 • Here, we present three types of Kriging according to the n(u) trend component model X −2 λα(u)CR(uα − u); a) Simple Kriging (SK): Mean m(u) is known and α=1 constant throughout the study area where m(u) = m; known 8u 2 funderstudy regiong CR(h) := EfR(u)R(u + h)g • Error variance is a quadratic function of Kriging weights 2 σE (u) = Q(λα(u)); 8α = 1; :::; n(α) • Optimal weights to minimize error variance 1 @σ2 (u) E = 0; 8α = 1; :::; n(u) 2 @λα(u) n(u) n(u) X X λα(u)CR(uα − uβ ) − λα(u)CR(uα − u) = 0; α=1 α=1 8α = 1; ::; n(α) Figure 3: Semi-variogram • Stationarity of the mean CR(h) = Ef(R(u) − µ(R(u)))(R(u + h) − µ(R(u + h)))g 3.