Universal Kriging) 1.4

Introduction to kriging Rodolphe Le Riche To cite this version: Rodolphe Le Riche. Introduction to kriging. Doctoral. France. 2014. cel-01081304 HAL Id: cel-01081304 https://hal.archives-ouvertes.fr/cel-01081304 Submitted on 7 Nov 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Introduction to kriging Rodolphe Le Riche1 1 CNRS and Ecole des Mines de St-Etienne, FR Class given as part of the “Modeling and Numerical Methods for Uncertainty Quantification” French-German summer school, Porquerolles, Sept. 2014 Course outline 1. Introduction to kriging (R. Le Riche) 1.1. Gaussian Processes 1.2. Covariance functions 1.3. Conditional Gaussian Processes (kriging) 1.3.1. No trend (simple kriging) 1.3.2. With trend (universal kriging) 1.4. Issues, links with other methods Kriging : introduction Context : scalar measurements ( y1,…, yn) at n positions (x1 ,…, xn) in a d -dimensional space X ⊂ ℝd What can be said about possible measures at any x using probabilities ? 2 (Krige, 1951; Matheron, y 1963) Here, kriging for regression. y1 5 Kriging = a family of y metamodels (surrogate) with embedded uncertainty model. x1 x2 x5 ... Random processes Random variable, Y get an instance y Expl : random event ω∈Ω if dice ≤ 3 , y =1 ( e.g., throw dice ) 3 < dice ≤ 5 , y = 2 dice = 6 , y = 3 Random process, Y(x) A set of RVs indexed by x random event ω∈Ω get a function y(x): ( e.g., wheather ) x ∈ X ⊂ ℝd → ℝ Random processes Repeat the random event y x Ex : three events instances, three y(x)'s. They are different, yet bear strong similarities. Gaussian processes Each Y(x) follows a Gaussian law Y (x) ∼ N (μ(x) , C(x , x) ) and, 1 1 x n d×n Y (x ) Y ∀ X = … ∈ X ⊂ ℝ , Y = … = …1 ∼ N (μ , C ) (xn) ( n) Y (Y x ) ( n) i j Cij = C (x , x ) with probability density function (multi-Normal law), 1 1 T −1 p( y) = exp − ( y−μ ) C ( y−μ ) (2π)n/2 det1/2(C) ( 2 ) Note : C is called Gram matrix in the SVM class (J.-M. Bourinet) Gaussian processes (illustration) Y Y 1 2 Y3 n = 3 Cij's covariances between Y and Y µ i j 1 y (linear couplings) C13 3 y y y1 2 µ Other possible C12 C23 3 µ illustration : contour 2 y(x) lines of p(y) as ellipses with principal axes as eigen- vectors/values of C-1 x1 x2 x3 Special case : no spatial covariance Y (x) ∼ N (μ(x) , σ (x)2) i.e., GP generalize a trend with white noise “ “ regression Y (x) = μ(x) + Ε(x) where Ε(x) ∼ N (0 , σ(x)2) σ2 0 0 Y (x1) Y 1 At n observation points, Y = … = …1 ∼ N μ , 0 ⋱ 0 n (Y (x )) (Y n) 2 ( [0 0 σ n]) Example : μ(x) = x2 , σ(x) = 1 Numerical sampling of a GP To plot GP trajectories (in 1 and 2Ds), perform the eigen analysis C = U D2U T and then notice Y = μ +U D Ε , Ε ∼ N (0, I) ( prove that E Y = μ and Cov(Y ) = C , + a Gaussian vector is fully determined by its 2 first moments ) In R, Ceig <- eigen(C) y <- mu + Ceig$vectors %*% diag(sqrt(Ceig$values)) %*% matrix(rnorm(n)) ( Cf. previous illustrations, functions plotted with large n's) More efficient implementation : cf. Carsten Proppe class, sl. “Discretization of random processes” Definition of covariance functions µ i How do we build C and and account for the data points ( x , yi ) ? → Start with C . The covariance function, a.k.a. kernel, Cov(Y (x),Y (x ')) = C (x , x ') is only a function of x and x' , C : X × X → ℝ The kernel defines the covariance matrix through i j C ij = C( x , x ) Valid covariance functions All functions C : ℝd×ℝd →ℝ are not valid kernels. Kernels must yield positive semidefinite covariance matrices, C : ∀u∈ℝn , uT C u ⩾ 0 ( but C(x,x') < 0 may happen ) φ Functional view (cf. Mercer's theorem) : let i be square integrable λ eigenfunctions of x and i ≥ 0 the associated eigenvalues, N C(x , x ') = ∑ λi φi (x)φi(x ') , N=∞ but finite for degenerate kernels i=1 Interpretation : kernels actually work in an N - (possibly infinite) T dimensional feature space where a point is (φ1(x) , … , φN ( x)) Stationary covariance functions The covariance function depends only on τ = x-x' , but not on the position in space Cov(Y (x),Y (x ')) = C (x , x ') = C(x−x ') = C(τ) C(0) = σ2 , C(τ) = σ 2 R(τ) ( R the correlation) Example : the squared exponential covariance function (Gaussian) d 2 2 ∣xi−xi '∣ Cov(Y (x),Y (x ')) = C (x−x ') = σ exp −∑ 2 ( i=1 2θ ) i d 2 2 ∣τi∣ = σ ∏ exp − 2 i=1 ( 2θ ) i Note : θi is a length scale ≡ bandwidth (SVM class) ≡ 1/ γ Gaussian kernel, illustration d 2 2 ∣xi−xi '∣ Cov(Y (x),Y (x ')) = C (x−x ') = σ exp −∑ 2 ( i=1 2θ ) i y C(x , x ') ∥x−x '∥ x The regularity and frequency content of the trajectories is controlled by the θ covariance functions. The i's act as length scales. Regularity of covariance functions For stationary processes, the trajectories y(x) are p times differentiable (in the mean square sense) if C(τ) is 2p times differentiable at τ=0. →The properties of C(τ) at τ=0 define the regularity of the process. Expl : trajectories with Gaussian kernels are infinitely differentiable (very – unrealistically ? – smooth) Recycling covariance functions The product of kernels is a kernel, N C(x , x ') = ∏ C i(x , x ') i=1 (expl, d>1 kernels like the Gaussian kernel) The sum of kernels is a kernel, N C(x , x ') = ∑ Ci (x , x ') i=1 Let W(x) = a(x) Y(x) , where a(x) is a deterministic function. Then, Cov(W (x),W (x ')) = a(x)C(x , x ')a(x ') Examples of stationary kernels (the ones implemented in the DiceKriging R package) d 2 General form , C(x , x ') = σ ∏ R(∣xi−x 'i∣) i=1 2 Gaussian , R(τ) = exp − τ (infinitely differentiable trajectories) ( 2θ2 ) √5∣τ∣ 5 τ2 √5∣τ∣ Matérn ν=5/2 , R(τ) = 1+ + exp − (twice diff. tr.) ( θ 3θ2 ) ( θ ) √3∣τ∣ √3∣τ∣ ν= / (τ) = + − Matérn 3 2 , R (1 θ )exp( θ ) (once diff. tr.) ∣τ∣p (τ) = − < ≤ Power-exponential , R exp( θ ) , 0 p 2 (tr. not diff. except for p=2 ) Matérn ν=5/2 is the default choice. They are functions of d+1 hyperparameters : σ θ and the i's to learn from data. Tuning of hyperparameters Our first use of the observations ( X ,Y ) x1 Y (x1) where X = … and Y = … ( xn) (Y (xn)) σ θ Three paths to selecting hyperparameters ( and the i's ) : maximum likelihood, cross-validation, Bayesian. (discussed here) Current statistical model : Y (x) ∼ N (μ(x) , C (x , x)) equivalently , Y ( x) = μ( x)+Z(x) where μ(x) known (deterministic) and Z(x) ∼ N ( 0 , C(x , x)) Maximum of likelihood estimate (1/2) Likelihood : the probability of observing the observations as a function of the hyperparameters 1 1 T −1 L(σ ,θ) = p( y∣σ ,θ) = exp − ( y−μ) C ( y−μ ) (2 π)n/2 det1/2(C) ( 2 ) i j 2 i j where C ij = C(x , x ; σ ,θ) = σ R(x , x ; θ) i and μi = μ(x ) mLL max L(σ ,θ) ⇔ min⏞−log L(σ ,θ) σ ,θ σ , θ Note : compare the likelihood L(σ ,θ) to the regularized loss function of the SVM class, L(σ ,θ) = reg_terms(σ ,θ) × exp(−loss_function(σ ,θ)) Maximum of likelihood estimate (2/2) −2 n 1 T −1 mLL(σ ,θ) = log(2 π)+nlog(σ)+ log (det (R))+σ ( y−μ ) R ( y−μ ) 2 2 2 ∂mLL 2 1 T −1 = 0 ⇒ σ̂ = ( y−μ ) R(θ) ( y−μ ) ∂σ n However, the calculations cannot be carried out analytically for θ. So, numerically, minimize the “concentrated” likelihood, min mLL(σ̂ (θ),θ) θ∈[θmin ,θmax ] where θmin > 0 A nonlinear, multimodal optimization problem. In R / DiceKriging , solved by a mix of evolutionary – global – and BFGS – local – algorithms. Where do we stand ? Y (x) normal with known average μ(x) and we have learned the covariance C(x−x ') μ(x)+1.95 σ̂ y(x) μ(x) μ(x)−1.95σ̂ x Further use the observations ( X , Y=y ). Make such a model interpolate them → last step to kriging. Conditioning of a Gaussian random vector Let U and V be jointly Gaussian random vectors, C C U μU U UV ∼ N , T [V ] [μV ] ( [C UV CV ]) then the conditional distribution of U knowing V =v is U |V =v ∼ N μ +C C−1(v−μ ) , C −C C−1 CT ⏟U UV V V ⏟U UV V UV ( cond. mean cond. covar. ) ✔ the conditional distribution is still Gaussian ✔ the conditional covariance does not depend on the observations v and this is all we need ..

Load more