The Gaussian Radon Transform and Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
THE GAUSSIAN RADON TRANSFORM AND MACHINE LEARNING IRINA HOLMES AND AMBAR N. SENGUPTA Abstract. There has been growing recent interest in probabilistic interpretations of kernel- based methods as well as learning in Banach spaces. The absence of a useful Lebesgue measure on an infinite-dimensional reproducing kernel Hilbert space is a serious obstacle for such stochastic models. We propose an estimation model for the ridge regression problem within the framework of abstract Wiener spaces and show how the support vector machine solution to such problems can be interpreted in terms of the Gaussian Radon transform. 1. Introduction A central task in machine learning is the prediction of unknown values based on `learning' from a given set of outcomes. More precisely, suppose X is a non-empty set, the input space, and D = f(p1; y1); (p2; y2);:::; (pn; yn)g ⊂ X × R (1.1) is a finite collection of input values pj together with their corresponding real outputs yj. The goal is to predict the value y corresponding to a yet unobserved input value p 2 X . The fundamental assumption of kernel-based methods such as support vector machines (SVM) is that the predicted value is given by f^(p), where the decision or prediction function f^ belongs to a reproducing kernel Hilbert space (RKHS) H over X , corresponding to a positive definite function K : X × X ! R (see Section 2.1.1). In particular, in ridge regression, the predictor ^ function fλ is the solution to the minimization problem: n ! ^ X 2 2 fλ = arg min (yj − f(pj)) + λkfk ; (1.2) f2H j=1 where λ > 0 is a regularization parameter and kfk denotes the norm of f in H. The regularization term λkfk2 penalizes functions that `overfit’ the training data. The minimization problem (1.2) has a unique solution, given by a linear combination of the functions Kpj = K(pj; ·). Specifically: arXiv:1310.4794v2 [stat.ML] 13 Mar 2014 n ^ X fλ = cjKpj (1.3) j=1 Date: March 2014. 2010 Mathematics Subject Classification. Primary 44A12, Secondary 28C20, 62J07. Key words and phrases. Gaussian Radon Transform, Ridge Regression, Kernel Methods, Abstract Wiener Space, Gaussian Process. Irina Holmes is a Dissertation Year Fellow at Louisiana State University, Department of Mathematics; [email protected]. Ambar Niel Sengupta is Hubert Butts Professor of Mathematics at Louisiana State University; [email protected]. 1 where c 2 Rn is given by: −1 c = (KD + λIn) y (1.4) n×n where KD 2 R is the matrix with entries [KD]ij = K(pi; pj), In is the identity matrix of n size n and y = (y1; : : : ; yn) 2 R is the vector of outputs. We present a geometrical proof of this classic result in Theorem A.1. Recently there has been a surge in interest in probabilistic interpretations of kernel-based learning methods; see for instance [1, 13, 18, 19, 21, 26]. As we shall see below, there is a Bayesian interpretation and a stochastic process approach to the ridge regression prob- lem when the RKHS is finite-dimensional, but many RKHSs used in practice are infinite- dimensional. The absence of Lebesgue measure on infinite-dimensional Hilbert spaces poses a roadblock to such interpretations in this case. In this paper: • We show that there is a valid stochastic interpretation to the ridge regression prob- lem in the infinite-dimensional case, by working within the framework of abstract Wiener spaces and Gaussian measures instead of working with the Hilbert space alone. Specifically, we show that the ridge regression solution may be obtained in terms of a conditional expectation and the Gaussian Radon transform. • We show that, within the more traditional spline setting, the element of H of minimal norm that satisfies f(pj) = yj for 1 ≤ j ≤ n can also be expressed in terms of the Gaussian Radon transform. • We propose a way to use the Gaussian Radon transform for a broader class of predic- tion problems. Specifically, this method could potentially be used to not only predict a particular output f(p) of a future input p, but to predict a function of future out- puts; for instance, one could be interested in predicting the maximum or minimum value over an interval of future outputs. • In this work we start with a RKHS H and complete it with respect to a measurable norm to obtain a Banach space B. However, the space B does not necessarily consist of functions. In light of recent interest in reproducing kernel Banach spaces, which do consist of functions, we propose a method to realize B as a space of functions. 1.1. Probabilistic Interpretations to the Ridge Regression Problem in Finite Dimensions. Let us first consider a Bayesian perspective: suppose that our parameter space is an RKHS H with reproducing kernel K : X × X ! R. If H is finite-dimensional, we take standard Gaussian measure on H as our prior distribution: 1 − 1 kfk2 ρ(f) = e 2 , for all f 2 H; (2π)d=2 where d is the dimension of H. Let f~denote, for every f 2 H, the continuous linear functional hf; ·i on H. With respect to standard Gaussian measure, every f~ is normally distributed with mean 0 and variance kfk2, and Cov(f;~ g~) = hf; gi. Now recall that H contains the ~ functions Kp = K(p; ·) on X , and f(p) = hf; Kpi for all f 2 H, p 2 X . Then fKpgp2X is ~ ~ 0 a centered Gaussian process on H with covariance function K: Cov(Kp; Kp0 ) = K(p; p ) for 0 all p; p 2 X . If we are given the training data D = f(p1; y1);:::; (pn; yn)g ⊂ X × R and we assume the measurement contains some error we would like to model as Gaussian noise, we choose an orthonormal set fe1; : : : ; eng ⊂ H such that ej ? Kpi for every 1 ≤ i; j ≤ n. Then ~ fKpj g1≤j≤n and fe~jg1≤j≤n 2 are independent centered Gaussian processes on H, and for every f 2 H we model our data as arising from the event: p ~ y~j = Kpj (f) + λe~j, for all 1 ≤ j ≤ n; where λ > 0 is a fixed parameter. Then for every f 2 H, fy~1;:::; y~ng are independent Gaussian random variables on H with mean f(pj) and variance λ for every 1 ≤ j ≤ n, which gives rise to the statistical model of probability distributions on Rn: n 1 1 2 Y − (xj −f(pj )) ρλ(xjf) = p e 2λ ; j=1 2πλ n for every x = (x1; : : : ; xn) 2 R . Replacing x with the vector y = (y1; : : : ; yn) of observed values, the posterior distribution resulting from this is proportional to: − 1 [Pn (y −f(p ))2+λkfk2] e 2λ j=1 j j : Therefore finding the maximum a posteriori (MAP) estimator in this situation amounts exactly to finding the solution to the ridge regression problem in (1.2). Clearly, this Bayesian approach depends on H being finite-dimensional; however, repro- ducing kernel Hilbert spaces used in practice are often infinite-dimensional (such as those arising from Gaussian RBF kernels). The ridge regression SVM previously discussed goes through regardless of the dimensionality of the RKHS, and there is still a need for a valid stochastic interpretation of the infinite-dimensional case. We now explore another stochastic approach to ridge regression, which is equivalent to the Bayesian one, but which can be carried over in a sense to the infinite-dimensional case, as we shall see later. Suppose again that H is a finite-dimensional RKHS over X , with reproducing kernel K, and equipped with ~ standard Gaussian measure. Recall that fKpgp2X is then a centered Gaussian process on H with covariance function K. If we assume that the data arises from some unknown function ~ ~ in H, then the relationship f(p) = Kp(f) suggests that the random variable Kp is a good model for the outputs. Moreover, the training data D = f(p1; y1);:::; (pn; yn)g provides ~ some previous knowledge of the random variables Kpj , which we can use to refine our esti- ~ mation of Kp by taking conditional expectations. In other words, our first guess would be to estimate the output of a future input p 2 X by: ~ ~ ~ E[KpjKp1 = y1;:::; Kpn = yn]: (1.5) But if we want to include some possible noise in the measurements, we would like to have a centered Gaussian process f1; : : : ; ng on H, with covariance function Cov(i; j) = λδi;j for some parameter λ > 0, which is also independent of fKpj g1≤j≤n. Let us fix p 2 X , a `future' input whose output we would like to predict. To take measurement error into account, we choose again an orthonormal set fe1; : : : ; eng ⊂ H such that: ? fe1; : : : ; eng ⊂ [spanfKp1 ;:::;Kpn ;Kpg] ; and λ > 0, and set: p ~ yj = Kpj + λe~j, for all 1 ≤ j ≤ n: Then we estimate the outputy ^(p) as the conditional expectation: p ~ ~ y^(p) = E[KpjKpj + λe~j = yj; 1 ≤ j ≤ n]: 3 As shown in Lemma B.1 below: y^(p) = a1y1 + ::: + anyn; n where a = (a1; : : : ; an) 2 R is: h p i −1 ~ ~ a = A Cov(Kp; Kpj + λe~j) ; 1≤j≤n with: [A]i;j = [KD + λIn]i;j; for all 1 ≤ i; j ≤ n, with KD being the n×n matrix with entries given by K(pi; pj). Moreover: p ~ ~ Cov(Kp; Kpj + λe~j) = hKp;Kpj i = Kpj (p): Note that this last relationship is why we required that fe1; : : : ; eng also be orthogonal to Kp. This yields: n X −1 y^(p) = [(KD + λIn) y]jKpj (p); j=1 ^ showing that the predictiony ^(p) is precisely the ridge regression solution fλ(p) in (1.3).