Computing Functions of Random Variables Via Reproducing Kernel Hilbert Space Representations
Total Page:16
File Type:pdf, Size:1020Kb
CORE Metadata, citation and similar papers at core.ac.uk Provided by Springer - Publisher Connector Stat Comput (2015) 25:755–766 DOI 10.1007/s11222-015-9558-5 Computing functions of random variables via reproducing kernel Hilbert space representations Bernhard Schölkopf1 · Krikamol Muandet1 · Kenji Fukumizu2 · Stefan Harmeling3 · Jonas Peters4 Accepted: 8 March 2015 / Published online: 11 June 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com Abstract We describe a method to perform functional tions we carry out using them. Choosing a data type, such operations on probability distributions of random variables. as Integer, Float, or String, determines the possible values, The method uses reproducing kernel Hilbert space represen- as well as the operations that an object permits. Operations tations of probability distributions, and it is applicable to all typically return their results in the form of data types. Com- operations which can be applied to points drawn from the posite or derived data types may be constructed from simpler respective distributions. We refer to our approach as kernel ones, along with specialized operations applicable to them. probabilistic programming. We illustrate it on synthetic data The goal of the present paper is to propose a way to and show how it can be used for nonparametric structural represent distributions over data types, and to generalize equation models, with an application to causal inference. operations built originally for the data types to operations applicable to those distributions. Our approach is nonpara- · · Keywords Kernel methods Probabilistic programming metric and thus not concerned with what distribution models Causal inference make sense on which statistical data type (e.g., binary, ordi- nal, categorical). It is also general, in the sense that in 1 Introduction principle, it applies to all data types and functional opera- tions. The price to pay for this generality is that Data types, derived structures, and associated operations play a crucial role for programming languages and the computa- • our approach will, in most cases, provide approximate B results only; however, we include a statistical analysis of Bernhard Schölkopf the convergence properties of our approximations, and [email protected] • for each data type involved (as either input or output), Krikamol Muandet we require a positive definite kernel capturing a notion of [email protected] similarity between two objects of that type. Some of our Kenji Fukumizu results require, moreover, that the kernels be characteris- [email protected] tic in the sense that they lead to injective mappings into Stefan Harmeling associated Hilbert spaces. [email protected] Jonas Peters In a nutshell, our approach represents distributions over [email protected] objects as elements of a Hilbert space generated by a kernel 1 Max Planck Institute for Intelligent Systems, 72076 and describes how those elements are updated by operations Tübingen, Germany available for sample points. If the kernel is trivial in the sense 2 Institute of Statistical Mathematics, 10-3 Midori-cho, that each distinct object is only similar to itself, the method Tachikawa, Tokyo reduces to a Monte Carlo approach where the operations 3 Institut für Informatik, Heinrich-Heine-Universität, 40225 are applied to sample points which are propagated to the Düsseldorf, Germany next step. Computationally, we represent the Hilbert space 4 ETH Zürich, Seminar für Statistik, 8092 Zürich, Switzerland elements as finite weighted sets of objects, and all opera- 123 756 Stat Comput (2015) 25:755–766 tions reduce to finite expansions in terms of kernel functions Φ : X → H (4) between objects. x → Φ(x), (5) The remainder of the present article is organized as follows. After describing the necessary preliminaries, we where the feature map (or kernel map) Φ satisfies provide an exposition of our approach (Sect. 3). Section 4 analyzes an application to the problem of cause–effect infer- k(x, x ) = Φ(x), Φ(x ) (6) ence using structural equation models. We conclude with a limited set of experimental results. for all x, x ∈ X . One can show that every k taking the form (6) is positive definite, and every positive definite k allows 2 Kernel Maps the construction of H and Φ satisfying (6). The canonical feature map, which is what by default we think of whenever 2.1 Positive definite kernels we write Φ,is X The concept of representing probability distributions in Φ : X → R (7) a reproducing kernel Hilbert space (RKHS) has recently x → k(x,.), (8) attracted attention in statistical inference and machine learn- ing (Berlinet and Agnan 2004; Smola et al. 2007). One of the with an inner product satisfying the reproducing kernel prop- advantages of this approach is that it allows us to apply RKHS erty methods to probability distributions, often with strong theo- retical guarantees (Sriperumbudur et al. 2008, 2010). It has ( , ) = ( ,.), ( ,.). been applied successfully in many domains such as graphical k x x k x k x (9) models (Song et al. 2010, 2011), two-sample testing (Gretton et al. 2012), domain adaptation (Huang et al. 2007; Gretton Mapping observations x ∈ X into a Hilbert space is rather et al. 2009; Muandet et al. 2013), and supervised learning on convenient in some cases. If the original domain X has no probability distributions (Muandet et al. 2012; Szabó et al. linear structure to begin with (e.g., if the x are strings or 2014). We begin by briefly reviewing these methods, starting graphs), then the Hilbert space representation provides us with some prerequisites. with the possibility to construct geometric algorithms using H X We assume that our input data {x1,...,xm} live in a non- the inner product of . Moreover, even if is a linear empty set X and are generated i.i.d. by a random experiment space in its own right, it can be helpful to use a nonlinear with Borel probability distribution p.Byk, we denote a pos- feature map in order to construct algorithms that are linear itive definite kernel on X × X , i.e., a symmetric function in H while corresponding to nonlinear methods in X . k : X × X → R (1) 2.3 Kernel maps for sets and distributions (x, x ) → k(x, x ) (2) One can generalize the map Φ to accept as inputs not satisfying the following nonnegativity condition: for any m ∈ only single points, but also sets of points or distributions. kernel map of a set of points N, and a1,...,am ∈ R, It was pointed out that the X := {x1,...,xm}, m a a k(x , x ) ≥ 0. (3) m i j i j 1 i, j=1 μ[X]:= Φ(x ), (10) m i i=1 If equality in (3)impliesthata1 =···=am = 0, the kernel is called strictly positive definite. corresponds to a kernel density estimator in the input domain (Schölkopf and Smola 2002; Schölkopf et al. 2001), pro- 2.2 Kernel maps for points vided the kernel is nonnegative and integrates to 1. However, the map (10) can be applied for all positive definite kernels, Kernel methods in machine learning, such as Support Vector including ones that take negative values or that are not nor- Machines or Kernel PCA, are based on mapping the data into malized. Moreover, the fact that μ[X] lives in an RKHS and a reproducing kernel Hilbert space (RKHS) H (Boser et al. the use of the associated inner product and norm will have a 1992; Schölkopf and Smola 2002; Shawe-Taylor and Cris- number of subsequent advantages. For these reasons, it would tianini 2004; Hofmann et al. 2008; Steinwart and Christmann be misleading to think of (10) simply as a kernel density esti- 2008), mator. 123 Stat Comput (2015) 25:755–766 757 , Table 1 What information about a sample X does the kernel map μ[X] which equals (11)fork(x, x ) = e x x using (8). [see (10)] contain? We can use the map to test for equality of data sets, ( , ) = , k x x x x Mean of X μ[X]−μ[X ] = 0 ⇐⇒ X = X , (15) k(x, x) = (x, x+1)n Moments of X up to order n ∈ N k(x, x ) strictly p.d. All of X (i.e., μ injective) or distributions, μ[ ]−μ[ ] = ⇐⇒ = . Table 2 What information about p does the kernel map μ[p] [see p p 0 p p (16) (11)] contain? For the notions of characteristic/universal kernels, see Steinwart (2002); Fukumizu et al. (2008, 2009); an example thereof is Two applications of this idea lead to tests for homogeneity the Gaussian kernel (26) and independence . In the latter case, we estimate μ[px py]− μ[p ] (Bach and Jordan 2002; Gretton et al. 2005); in the k(x, x) =x, x Expectation of p xy former case, we estimate μ[p]−μ[p] (Gretton et al. k(x, x) = x, x+1 n Moments of p up to order n ∈ N 2012). ( , ) μ k x x characteristic/universal All of p (i.e., injective) Estimators for these applications can be constructed in terms of the empirical mean estimator (the kernel mean of the empirical distribution) The kernel map of a distribution p can be defined as the m expectation of the feature map (Berlinet and Agnan 2004; 1 Smola et al. 2007; Gretton et al. 2012), μ[ˆp ]= Φ(x ) = μ[X], (17) m m i i=1 μ[p]:=Ex∼p Φ(x) , (11) where X ={x1,...,xm} is an i.i.d. sample from p [cf. (10)]. where we overload the symbol μ and assume, here and below, As an aside, note that using ideas from James–Stein esti- that p is a Borel probability measure, and mation (James and Stein 1961), we can construct shrinkage estimators that improve upon the standard empirical estima- Ex,x∼p k(x, x ) < ∞.