A Note on Kriging and Gaussian Processes

A NOTE ON KRIGING AND GAUSSIAN PROCESSES Mohammad Shekaramiz, Todd K. Moon, and Jacob H. Gunther Electrical and Computer Engineering Department and Information Dynamics Laboratory Utah State University [email protected], [email protected], [email protected] 1. INTRODUCTION TO GAUSSIAN PROCESSES [1–4] • Training data: input-output pairs D = f(xi; yi)ji = 1; 2; :::; ng • Each input is a vector xi of dimension d • Each target is a real-valued scalar yi = f(xi) for some unknown function f, possibly corrupted by noise • Collect inputs into matrix X 2 Rd×n and the correspond- ing outputs into vector y 2 Rn, resulting in the input- output training set D := fX; yg • Assumption: There exit an underlying process for the unknown function f Figure 1: Example showing a graphical representation of two • Objective: Predict the output y? at the input test x? i.e., input-output training data and one test data [1]. y? = f(x?) using the training data set D • Restate the Objective: Infer distribution over function f given the training data i.e., p(fjD = fX; yg) • How does it work? GP assumes that p(f(x1); :::; f(xn)) is jointly Gaussian with some mean µ(X) and covariance • Then use this information to make predictions for y? corre- Σ(X) given by Σij = κ(xi; xj ), where k is a P.D. kernel sponding to new (test) input data X? i.e., computing function i.e., p f(x1); :::; f(xn) ∼ N µ(X); Σ(X) Z p y?jX?;D = fX; yg) = p(y?jf; X? p(fjD)df • Key idea: If xi and xj are by the kernel similar, we expect that the output of the function at those points to be similar, as well • Remark: Need direct representation for function f via p(fjD) rather than the parametric description p(θjD) • Graphical illustration of a Gaussian Process: As an example, a Gaussian Process for 2 training data points and 1 • In other words, we want to find a way to perform Bayesian testing point is illustrated in Fig. 1, which is a mixture of inference over “functions” directed and undirected graphical model describing • Solution: “Gaussian Process” Q p(y; fjX) = N fjm(X);K(X) i p yijf(xi) 1.1. Definition of Gaussian Process Modeling 1.2. GPs for Regression • GP is a way of defining distribution over functions • Let the prior on the regression function be a GP, denoted by It defines a prior over function f and then based on Bayesian inference convert it into a posterior once some data is ob- f(x) ∼ GPm(x); κ(x; x0); served • Gaussian Process is a stochastic process whose realizations consist of a collection random variables (r.v.s) associated where m(x) = E[f(x)] is the mean function κ(x; x0) = with every point in range of either time or space E[(f(x)−m(x))(f(x0)−m(x0))T ] is the covariance func- In such a process each r.v. has a normal distribution tion • A GP is completely specified by a mean function and a pos- • For any finite set of n data points, this process defines a itive definite (p.d.) covariance function joint Gaussian distribution This property facilitates model fitting as only the first- and second-order moments of the process are required to be p(fjX) = N (fjµ;K) specified. where Kij = κ(xi; xj ) and µ = [m(x1); :::; m(xn)] 1.2.1. Prediction Using GP Modeling for Noise-Free Ob- servations • In this case, we expect the GP to predict f(x) for a value at location x that has already seen, to return the answer f(x) with no uncertainty • In other words, GP-regression model should act as an inter- polator • The joint distribution between the training and test data has the following form f µ KK? ∼ N ; T ; f? µ? K? K?? Figure 2: Finding z(u∗) based on spatial interpolation where f contains the set of output training data and f? are the predictions that are after at the input test data points 1.2.3. Prediction Using Semi-Parametric GP Modeling • Inference • Semi-parametric GP combines a parametric model with a non-parametric model p(f jX ; X; f) = N (f jµ ; Σ ); ? ? ? ? ? • It uses a linear model for the mean of the process where f(x) = βT φ(x) + r(x); where r(x) ∼ GP(0; κ(x; x0)) models the residuals and µ = µ(X ) + KT K−1f − µ(X) ? ? ? assuming that β ∼ N(b;B): Σ = K − KT K−1K ? ?? ? ? f(x) ∼ GPφ(x)T b; κ(x; x0) + φ(x)T Bφ(x0) • Posterior predictive density • Prediction: µ? can be used as the prediction for f? p(f?jX?; X; y) ∼ N (µf? ; Σf? ); 1.2.2. Prediction Using GP Modeling for Noisy Observa- where tions 8µ = ΦT β¯ + KT K−1(y − ΦT β¯) > f? ? ? y > T −1 T −1 −1 T −1 <Σf = K?? − K? Ky K? + R (B + ΦKy Φ ) R • Suppose the ith input-output training data is (Xi; yi), where ? ¯ −1 T −1 −1 −1 −1 yi = f(xi) + " and >β = (ΦKy Φ + B ) (ΦKy y + B b) > −1 :R = Φ? − ΦKy K?: 2 " ∼ N (0; σy); 8i = 1; :::; n: 2. KRIGING [5–8] • This model does not necessarily perform interpolation, but it must come close to the observed data • Illustrating example: Fig. (2) • Define • Need a model for spatial dependency to interpolate at un- 2 sampled locations Ky =: cov(yjX) = K + σyIN ; where K is the covariance matrix of the training data 2.1. Kriging • Joint density of the observed and the test data • Spatial interpolation method based on statistical model • Common types of Kriging y Ky K? ∼ N 0; T ; Simple Kriging (SK) f? K? K?? Ordinary Kriging (OK) where zero-mean case is considered for simplicity Kriging with a trend (KT) • Posterior predictive density Factorial Kriging (FK) Co-Kriging p(f?jX?; X; f) = N (f?jµ?; Σ?); • Kriging is based on GP modeling • Problem Description: where Estimating the value of a continuous attribute Z at any un- T −1 sampled location u∗ using only the available training data- µ? = K K y ? y pairs D = f(uα; z(uα))jα = 1; :::; n(α)g over the study T −1 Σ? = K?? − K? Ky K? area 2.2. Kriging Structure b) Ordinary Kriging (OK): Accounts for local fluc- tuations of the mean by limiting the domain of stationarity • Kriging estimators are based on linear regression estimation to local neighborhood W (u) of the R.V. Z(u∗) m(u0) = m; unknown 8u0 2 W (u) n(α) c) Kriging with a trend model(KT): Unknown local ∗ X Z (u∗) − µ(Z(u∗)) = λα(u)[Z(uα) − µ(uα)]; mean m(u0) smoothly varies within each local neighbor- α=1 hood W (u), hence over the entire under-study area Trend component is modeled as a linear combination where of functions fk(u) of the coordinates Z(uα): Random variable at point u K z(uα): Realization of Z(uα) at point u X m(u0) = ak(u0)fk(u0); ak(u0) ≈ ak; 8u0 2 W (u) λα(u): Kriging weight assigned to data z(uα) to compute k=0 z(u∗) µ(Z(uα)): Expected value of R.V. Z(uα) Remark: Functions fk(u0) may be directly suggested by the physics of the problem • Remark: The number of data, n(u), involved in the esti- Example: A linear trend model mation may change from one location to another. In prac- tice, only the n(u) data closest to the location u∗ are re- m(u) = m(ux; uy) = a0 + a1ux + a2uy tained. 2.4. Simple Kriging (SK) Model • Kriging interprets the unknown value z(u∗) and data val- ues z(uα) as realizations of R.V, Z(u∗) and Z(uα), re- • Model the trend component m(u) as a known stationary spectively mean i.e., µ(Z(u))) = m(u) = m • Goal: Minimizing the estimation error i.e., error variance 2 σE (u) under unbiasedness assumption of the estimator 2 ∗ minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) ∗ s.t. Ef(Z (u∗) − Z(u∗))g = 0 2 ∗ minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) • In Simple Kriging the constraint condition is already satis- s.t. Ef(Z∗(u ) − Z(u ))g = 0 ∗ ∗ fied i.e., SK estimator is unbiased ∗ • Decomposing Z(u), Based on semi-parametric GP EfZSK (u∗) − Z(u∗)g = m − m = 0 Z(u) := R(u) + µ(Z(u)); • Unconstraint minimization problem 2 ∗ where R(u) is the residual component and µ(Z(u)) is the minλ σE (u∗) = Var(Z (u∗) − Z(u∗)) trend component • Estimator error Z∗(u ) − Z(u ) can be viewed as a linear • Modeling the components ∗ ∗ combination of residual variables R(uα) and R.V. R(u) 1) Residual component: Modeled as a stationary GP with zero mean as n(α) ∗ X Z (u ) − Z(u ) = λ (u)(Z(u ) − m) − R(u) ∗ ∗ α α R(u) ∼ GP 0;CR(h) α=1 In other words, R(uα) := Z(uα) − m n(α) EfR(u)g = 0 ∗ X R (u) := λ (u)(Z(u ) − m) CovR(u);R(u + h) = EfR(u)R(u + h)g α α α=1 ∗ ∗ 2) Trend component: Z (u∗) − Z(u∗) := R (u) − R(u) EfZ(u)g = µ(Z(u)) := m(u) • Variance error n(u) n(u) 2 X X 2.3. Common Kriging Types σE (u) = λα(u)λβ (u)CR(uα − uβ ) + CR(0) α=1 β=1 • Here, we present three types of Kriging according to the n(u) trend component model X −2 λα(u)CR(uα − u); a) Simple Kriging (SK): Mean m(u) is known and α=1 constant throughout the study area where m(u) = m; known 8u 2 funderstudy regiong CR(h) := EfR(u)R(u + h)g • Error variance is a quadratic function of Kriging weights 2 σE (u) = Q(λα(u)); 8α = 1; :::; n(α) • Optimal weights to minimize error variance 1 @σ2 (u) E = 0; 8α = 1; :::; n(u) 2 @λα(u) n(u) n(u) X X λα(u)CR(uα − uβ ) − λα(u)CR(uα − u) = 0; α=1 α=1 8α = 1; ::; n(α) Figure 3: Semi-variogram • Stationarity of the mean CR(h) = Ef(R(u) − µ(R(u)))(R(u + h) − µ(R(u + h)))g 3.

A Note on Kriging and Gaussian Processes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support