G G/Technology to estimate variances at unsampled locations, aiding in the design of targeted sampling strategies. Intergraph: Real Time Operational Geospatial Applications Historical Background The underlying ideas behind GPs can be traced back to the geostatistics technique called kriging [4], named after Gaussian the South African miner Danie Krige. Kriging in this liter- ature was used to model response variables (e. g., ozone Hurricane Wind Fields, Multivariate Modeling concentrations) over 2D spatial fields as realizations of a stochastic process. Sacks et al. [12] described the use of kriging to model (deterministic) computer experiments. Gaussian Process Models It took more than a decade from this point for the larger in Spatial Data Mining computer science community to investigate GPs for pat- tern analysis purposes. Thus, in the recent past, GPs have NAREN RAMAKRISHNAN1,CHRIS BAILEY-KELLOGG2 witnessed a revival primarily due to work in the statisti- 1 Department of Computer Science, cal pattern recognition community [5] and graphical mod- Virginia Tech, Blacksburg, VA, USA els literature [3]. Neal established the connection between 2 Department of Computer Science, Gaussian processes and neural networks with an infinite Dartmouth College, Hanover, NH, USA number of hidden units [8]. Such relationships allow us to take traditional learning techniques and re-express them Synonyms as imposing a particular covariance structure on the joint distribution of inputs. For instance, we can take a trained Active data mining neural network and mine the covariance structure implied by the weights (given mild assumptions such as a Gaus- Definition sian prior over the weight space). Williams motivates the usefulness of such studies and describes common covari- Gaussian processes (GPs) are local approximation tech- ance functions [14]. Williams and Barber [15] describe niques that model spatial data by placing (and updating) how the Gaussian process framework can be extended to priors on the covariance structures underlying the data. classification in which the modeled variable is categorical. Originally developed for geo-spatial contexts, they are also Since these publications were introduced, interest in GPs applicable in general contexts that involve computing and has exploded with rapid publications in conferences such modeling with multi-level spatial aggregates, e. g., mod- as ICML, NIPS; see also the recently published book by eling a configuration space for crystallographic design, Rasmussen and Williams [11]. casting folding energies as a function of a protein’s con- tact map, and formulation of vaccination policies taking into account social dynamics of individuals. Typically, we Scientific Fundamentals assume a parametrized covariance structure underlying the A GP can be formally defined as a collection of random data to be modeled. We estimate the covariance parameters variables, any finite subset of which have a (multivariate) conditional on the locations for which we have observed normal distribution. For simplicity, we assume 2D spatial- data, and use the inferred structure to make predictions at ly distributed (scalar) response variables ti, one for each new locations. GPs have a probabilistic basis that allow us location xi = [xi1, xi2] where we have collected a data sam- 326 Gaussian Process Models in Spatial Data Mining ple. Observe that, in the limiting case, each random vari- where the last step follows by conditional independence of able has a Gaussian distribution (but it is not true that any {t1, t2,...,tn} w.r.t. xn+1 and the part of Covn+1 not con- collection of Gaussian random variables will induce a GP). tained in Covn. The denominator in the above expression Given a dataset D = {xi, ti}, i = 1... n, and a new data is another Gaussian random variable given by: point xn+1, a GP can be used to model the posterior P(t | D, x ) (which would also be a Gaussian). This P(t ,t ,...,t | x , x ,...,x , Cov ) n+1 n+1 1 2 n 1 2 n n is essentially what many Bayesian modeling techniques do = 1 − 1 −1 T (e. g., least squares approximation with normally distribut- exp [t1,t2,...,tn]Covn [t1,t2,...,tn] . λ2 2 ed noise), however, it is the specifics of how the posterior is modeled that make GPs distinct as a class of modeling Putting it all together, we get: techniques. To make a prediction of tn+1 at a point xn+1, GPs place P(t + |t ,t ,...,t , x , x ,...,x , Cov + ) n 1 1 2 n 1 2 n+1 n 1 greater reliance on ti’s from nearby points. This reliance is λ 2 1 −1 T specified in the form of a covariance prior for the process. = exp − [t ,t ,...,t + ]Cov [t ,t ,...,t + ] λ 2 1 2 n 1 n+1 1 2 n 1 One example of a covariance prior is: 1 1 −1 T ! " − [t ,t ,...,tn]Cov [t ,t ,...,tn] . 2 2 1 2 n 1 2 1 2 Cov(ti,tj) = α exp − a (x − x ) . (1) 2 k ik jk k=1 Computing the mean and variance of this Gaussian distri- Intuitively, this function captures the notion that response bution, we get an estimate of tn+1 as: variables at nearby points must have high correlation. In ˆ T −1 tn+1 = k Cov [t1,t2,...,tn] , (2) Eq. 1, α is an overall scaling term, whereas a1, a2 define n the length scales for the two dimensions. However, this pri- or (or even its posterior) does not directly allow us to deter- and our uncertainty in this estimates as: mine t from t since the structure only captures the covari- j i 2 T −1 σˆ = k − k Cov k , (3) ance; predictions of a response variable for new sample tn+1 n locations are thus conditionally dependent on the measured response variables and their sample locations. Hence, we where kT represents the n-vector of covariances with the must first estimate the covariance parameters (a1, a2,and new data point: α) from D, and then use these parameters along with D to T predict tn+1 at xn+1. k = [Cov(x1, xn+1) Cov(x2, xn+1) ... Cov(xn, xn+1)] , Before covering the learning procedure for the covariance α parameters (a1, a2,and ), it is helpful to develop expres- and k is the (n +1, n + 1) entry of Covn+1. Equations 2 sions for the posterior of the response variable in terms of and 3, together, give us both an approximation at any giv- these parameters. Since the jpdf of the response variables en point and an uncertainty in this approximation; they P(t1, t2,...,tn+1) is modeled Gaussian (we will assume will serve as the basic building blocks for closing-the-loop a mean of zero), we can write: between data modeling and higher level mining function- ality. 1 P(t ,t ,...,t + | x , x ,...,x , + ) = 1 2 n 1 1 2 n+1 Covn 1 λ The above expressions can be alternatively derived by 1 positing a linear probabilistic model and optimizing for 1 · − −1 T the MSE (mean squared error) between observed and pre- exp [t1,t2,...,tn+1]Covn+1 [t1,t2,...,tn+1] 2 dicted response values (e. g., see [12]). In this sense, the Gaussian process model considered here is also known as where we ignore λ1 as it is simply a normalizing factor. the BLUE (best linear unbiased estimator), but GPs are not Here, Covn+1 is the covariance matrix formed from the restricted to linear combinations of basis functions. (n + 1) data values (x1, x2,...,xn+1). A distribution for the To apply GP modeling to a given dataset, one must first unknown variable tn+1 can then be obtained as: ensure that the chosen covariance structure matches the P(tn+1|t1,t2,...,tn, x1, x2,...,xn+1, Covn+1) data characteristics. The above example used a station- ( | ) ary structure which applies when the covariance is trans- = P t1,t2,...,tn+1 x1, x2,...,xn+1, Covn+1 P(t1,t2,...,tn | x1, x2,...,xn+1, Covn+1) lation invariant. Various other functions have been studied in the literature (e. g., see [7,9,12]), all of which satisfy the P(t1,t2,...,tn+1 | x1, x2,...,xn+1, Covn+1) = , required property of positive definiteness of a covariance P(t1,t2,...,tn | x1, x2,...,xn, Covn) Gaussian Process Models in Spatial Data Mining 327 G Gaussian Process Models in Spatial Data Mining, Figure 1 Active mining with Gaussian processes. An initial sample of data points (a;shownasred circles) gives a preliminary approximation to the target function (b). Active sampling suggests new locations (c; blue diamonds) that improve the quality of approximation (d) matrix. The simplest covariance function yields a diago- including a constant term (gives another parameter to be nal matrix, but this means that no data sample can have an estimated) in the covariance formulation. influence on other locations, and the GP approach offers Learning the GP parameters θ = (a1, a2, α) can be under- no particular advantages. In general, by placing a pri- taken in the maximum likelihood (ML) and maximum or directly on the function space, GPs are appropriate a posteriori (MAP) frameworks, or in the true Bayesian for modeling ‘smooth’ functions. The terms a1, a2 cap- setting where we obtain a distribution over values. The log- ture how quickly the influence of a data sample decays likelihood for the parameters is given by: in each direction and, thus, the length scales for smooth- ness. L = log P(t1,t2,...,tn|x1, x2,...,xn, θ) An important point to note is that even though the GP real- n 1 ization is one of a random process, we can nevertheless = c + log P(θ) − log(2π) − log | Covn | 2 2 build a GP model for deterministic functions by choos- 1 −1 T ing a covariance structure that ensures the diagonal cor- − [t1,t2,...,tn]Cov [t1,t2,...,tn] .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages100 Page
-
File Size-