CORE Metadata, citation and similar papers at core.ac.uk

Provided by Springer - Publisher Connector Stat Comput (2015) 25:755–766 DOI 10.1007/s11222-015-9558-5

Computing functions of random variables via reproducing kernel Hilbert space representations

Bernhard Schölkopf1 · Krikamol Muandet1 · Kenji Fukumizu2 · Stefan Harmeling3 · Jonas Peters4

Accepted: 8 March 2015 / Published online: 11 June 2015 © The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract We describe a method to perform functional tions we carry out using them. Choosing a type, such operations on probability distributions of random variables. as Integer, Float, or String, determines the possible values, The method uses reproducing kernel Hilbert space represen- as well as the operations that an object permits. Operations tations of probability distributions, and it is applicable to all typically return their results in the form of data types. Com- operations which can be applied to points drawn from the posite or derived data types may be constructed from simpler respective distributions. We refer to our approach as kernel ones, along with specialized operations applicable to them. probabilistic programming. We illustrate it on synthetic data The goal of the present paper is to propose a way to and show how it can be used for nonparametric structural represent distributions over data types, and to generalize equation models, with an application to causal inference. operations built originally for the data types to operations applicable to those distributions. Our approach is nonpara- · · Keywords Kernel methods Probabilistic programming metric and thus not concerned with what distribution models Causal inference make sense on which statistical (e.g., binary, ordi- nal, categorical). It is also general, in the sense that in 1 Introduction principle, it applies to all data types and functional opera- tions. The price to pay for this generality is that Data types, derived structures, and associated operations play a crucial role for programming languages and the computa- • our approach will, in most cases, provide approximate B results only; however, we include a statistical analysis of Bernhard Schölkopf the convergence properties of our approximations, and [email protected] • for each data type involved (as either input or output), Krikamol Muandet we require a positive definite kernel capturing a notion of [email protected] similarity between two objects of that type. Some of our Kenji Fukumizu results require, moreover, that the kernels be characteris- [email protected] tic in the sense that they lead to injective mappings into Stefan Harmeling associated Hilbert spaces. [email protected]

Jonas Peters In a nutshell, our approach represents distributions over [email protected] objects as elements of a Hilbert space generated by a kernel 1 Max Planck Institute for Intelligent Systems, 72076 and describes how those elements are updated by operations Tübingen, Germany available for sample points. If the kernel is trivial in the sense 2 Institute of Statistical Mathematics, 10-3 Midori-cho, that each distinct object is only similar to itself, the method Tachikawa, Tokyo reduces to a Monte Carlo approach where the operations 3 Institut für Informatik, Heinrich-Heine-Universität, 40225 are applied to sample points which are propagated to the Düsseldorf, Germany next step. Computationally, we represent the Hilbert space 4 ETH Zürich, Seminar für Statistik, 8092 Zürich, Switzerland elements as finite weighted sets of objects, and all opera- 123 756 Stat Comput (2015) 25:755–766 tions reduce to finite expansions in terms of kernel functions Φ : X → H (4) between objects. x → Φ(x), (5) The remainder of the present article is organized as follows. After describing the necessary preliminaries, we where the feature map (or kernel map) Φ satisfies provide an exposition of our approach (Sect. 3). Section 4   analyzes an application to the problem of cause–effect infer-   k(x, x ) = Φ(x), Φ(x ) (6) ence using structural equation models. We conclude with a limited set of experimental results. for all x, x ∈ X . One can show that every k taking the form (6) is positive definite, and every positive definite k allows 2 Kernel Maps the construction of H and Φ satisfying (6). The canonical feature map, which is what by default we think of whenever 2.1 Positive definite kernels we write Φ,is

X The concept of representing probability distributions in Φ : X → R (7) a reproducing kernel Hilbert space (RKHS) has recently x → k(x,.), (8) attracted attention in statistical inference and machine learn- ing (Berlinet and Agnan 2004; Smola et al. 2007). One of the with an inner product satisfying the reproducing kernel prop- advantages of this approach is that it allows us to apply RKHS erty methods to probability distributions, often with strong theo- retical guarantees (Sriperumbudur et al. 2008, 2010). It has   ( , ) = ( ,.), ( ,.). been applied successfully in many domains such as graphical k x x k x k x (9) models (Song et al. 2010, 2011), two-sample testing (Gretton et al. 2012), domain adaptation (Huang et al. 2007; Gretton Mapping observations x ∈ X into a Hilbert space is rather et al. 2009; Muandet et al. 2013), and supervised learning on convenient in some cases. If the original domain X has no probability distributions (Muandet et al. 2012; Szabó et al. linear structure to begin with (e.g., if the x are strings or 2014). We begin by briefly reviewing these methods, starting graphs), then the Hilbert space representation provides us with some prerequisites. with the possibility to construct geometric algorithms using H X We assume that our input data {x1,...,xm} live in a non- the inner product of . Moreover, even if is a linear empty set X and are generated i.i.d. by a random experiment space in its own right, it can be helpful to use a nonlinear with Borel p.Byk, we denote a pos- feature map in order to construct algorithms that are linear itive definite kernel on X × X , i.e., a symmetric function in H while corresponding to nonlinear methods in X . k : X × X → R (1) 2.3 Kernel maps for sets and distributions   (x, x ) → k(x, x ) (2) One can generalize the map Φ to accept as inputs not satisfying the following nonnegativity condition: for any m ∈ only single points, but also sets of points or distributions. kernel map of a set of points N, and a1,...,am ∈ R, It was pointed out that the X := {x1,...,xm}, m a a k(x , x ) ≥ 0. (3) m i j i j 1  i, j=1 μ[X]:= Φ(x ), (10) m i i=1 If equality in (3)impliesthata1 =···=am = 0, the kernel is called strictly positive definite. corresponds to a kernel density estimator in the input domain (Schölkopf and Smola 2002; Schölkopf et al. 2001), pro- 2.2 Kernel maps for points vided the kernel is nonnegative and integrates to 1. However, the map (10) can be applied for all positive definite kernels, Kernel methods in machine learning, such as Support Vector including ones that take negative values or that are not nor- Machines or Kernel PCA, are based on mapping the data into malized. Moreover, the fact that μ[X] lives in an RKHS and a reproducing kernel Hilbert space (RKHS) H (Boser et al. the use of the associated inner product and norm will have a 1992; Schölkopf and Smola 2002; Shawe-Taylor and Cris- number of subsequent advantages. For these reasons, it would tianini 2004; Hofmann et al. 2008; Steinwart and Christmann be misleading to think of (10) simply as a kernel density esti- 2008), mator. 123 Stat Comput (2015) 25:755–766 757

  ,  Table 1 What information about a sample X does the kernel map μ[X] which equals (11)fork(x, x ) = e x x using (8). [see (10)] contain? We can use the map to test for equality of data sets,

( , ) = ,    k x x x x of X μ[X]−μ[X ] = 0 ⇐⇒ X = X , (15) k(x, x) = (x, x+1)n Moments of X up to order n ∈ N  k(x, x ) strictly p.d. All of X (i.e., μ injective) or distributions,

μ[ ]−μ[ ] = ⇐⇒ = . Table 2 What information about p does the kernel map μ[p] [see p p 0 p p (16) (11)] contain? For the notions of characteristic/universal kernels, see Steinwart (2002); Fukumizu et al. (2008, 2009); an example thereof is Two applications of this idea lead to tests for homogeneity the Gaussian kernel (26) and independence . In the latter case, we estimate μ[px py]− μ[p ] (Bach and Jordan 2002; Gretton et al. 2005); in the k(x, x) =x, x Expectation of p xy   former case, we estimate μ[p]−μ[p] (Gretton et al. k(x, x) = x, x+1 n Moments of p up to order n ∈ N 2012). ( , ) μ k x x characteristic/universal All of p (i.e., injective) Estimators for these applications can be constructed in terms of the empirical mean estimator (the kernel mean of the empirical distribution) The kernel map of a distribution p can be defined as the m expectation of the feature map (Berlinet and Agnan 2004; 1  Smola et al. 2007; Gretton et al. 2012), μ[ˆp ]= Φ(x ) = μ[X], (17) m m i   i=1 μ[p]:=Ex∼p Φ(x) , (11) where X ={x1,...,xm} is an i.i.d. sample from p [cf. (10)]. where we overload the symbol μ and assume, here and below, As an aside, note that using ideas from James–Stein esti- that p is a Borel probability measure, and mation (James and Stein 1961), we can construct shrinkage   estimators that improve upon the standard empirical estima-  Ex,x∼p k(x, x ) < ∞. (12) tor [see e.g., Muandet et al. (2014a, b)]. −1/2 One can show that μ[ˆpm] converges at rate m A sufficient condition for this to hold is the assumption that [cf. Smola et al. (2007), (Song 2008, Theorem 27), and there exists an M ∈ R such that Lopez-Paz et al. (2015)]:

Theorem 1 Assume that  f ∞ ≤ 1 for all f ∈ H with k(x,.) ≤ M < ∞, (13)  f H ≤ 1. Then with probability at least 1 − δ, or equivalently k(x, x) ≤ M2, on the support of p. Kernel E [ ( , )] ( /δ) maps for sets of points or distributions are sometimes referred z∼p k z z 2log 1 μ[ˆpm]−μ[p] H ≤ 2 + . to as kernel mean maps to distinguish them from the original m m kernel map. Note, however, that they include the kernel map (18) of a point as a special case, so there is some justification Independent of the requirement of injectivity, μ can be in using the same term. If p = pX is the law of a random X, we sometimes write μ[X] instead of μ[p]. used to compute expectations of arbitrary functions f living In all cases, it is important to understand what informa- in the RKHS, using the identity tion we retain, and what we lose, when representing an object   E ∼ [ f (x)]= μ[p], f , by its kernel map. We summarize the known results (Stein- x p (19) wart and Christmann 2008; Fukumizu et al. 2008; Smola which follows from the fact that k represents point evaluation et al. 2007; Gretton et al. 2012; Sriperumbudur et al. 2010) in the RKHS, in Tables 1 and 2. We conclude this section with a discussion of how to use   f (x) = k(x,.), f . (20) kernel mean maps. To this end, first assume that Φ is injec- tive, which is the case if k is strictly positive definite (see A small RKHS, such as the one spanned by the linear kernel Table 1) or characteristic/universal (see Table 2). Particular   cases include the moment generating function of a RV with k(x, x ) =x, x , (21) distribution p, may not contain the functions we are interested in. If, on the x, · Mp(.) = Ex∼p e , (14) other hand, our RKHS is sufficiently rich [e.g., associated 123 758 Stat Comput (2015) 25:755–766 with a universal kernel (Steinwart 2002)], we can use (19) metic operations on these measurements are operations on to approximate, for instance, the probability of any interval random variables. An example due to Springer (1979)issig- (a, b) on a bounded domain, by approximating the indica- nal amplification. Consider a set of n amplifiers connected n ( ,.) tor function I(a,b) as a kernel expansion i=1 ai k xi , and together in a serial fashion. If the amplification of the i- substituting the latter into (19). See Kanagawa and Fukumizu th amplifier is denoted by Xi , then the total amplification, (2014) for further discussion. Alternatively, if p has a den- denoted by Y ,isY = X1 · X2 ···Xn, i.e., a product of n sity, we can estimate it using methods such as reproducing random variables. kernel moment matching and combinations with kernel den- A well-established framework for arithmetic operation sity estimation (Song et al. 2008; Kanagawa and Fukumizu on independent random variables (iRVs) relies on integral 2014). transform methods (Springer 1979). The above example This shows that the map is not a one-way road: we can map of addition already suggests that Fourier transforms may our objects of interest into the RKHS, perform linear algebra help, and indeed, people have used transforms such as the and geometry on them (Schölkopf and Smola 2002), and at ones due to Fourier and Mellin to derive the distribution the end answer questions of interest. In the next section, we function of either the sum, difference, product, or quo- shall take this a step further, and discuss how to implement tient of iRVs (Epstein 1948; Springer and Thompson 1966; rather general operations in the RKHS. Prasad 1970; Springer 1979). Williamson (1989) proposes Before doing so, we mention two additional applications an approximation using Laguerre polynomials and a notion of kernel maps. Wecan map conditional distributions and per- of envelopes bounding the cumulative distribution function. form Bayesian updates (Fukumizu et al. 2008, 2013; Zhang This framework also allows for the treatment of dependent et al. 2011), and we can connect kernel maps to Fourier optics, random variables, but the bounds can become very loose leading to a physical realization as Fraunhofer diffraction after repeated operations. Milios (2009) approximates the (Harmeling et al. 2013). probability distributions of the input random variables as mixture models (using uniform and Gaussian distributions), and apply the computations to all mixture components. 3 Computing functions of independent random Jaroszewicz and Korzen (2012) consider a numerical variables approach to implement arithmetic operations on iRVs, repre- senting the distributions using piecewise Chebyshev approx- 3.1 Introduction and earlier work imations. This lends itself well to the use of approximation methods that perform well as long as the functions are well- A (RV) is a measurable function mapping behaved. Finally, Monte Carlo approaches can be used as possible outcomes of an underlying random experiment to a well and are popular in scientific applications [see e.g., Fer- set Z (often, Z ⊂ Rd , but our approach will be more general). son (1996)]. The probability measure of the random experiment induces The goal of the present paper is to develop a derived data the distribution of the random variable. We will below not type representing a distribution over another data type and to deal with the underlying probability space explicitly, and generalize the available computational operations to this data instead directly start from random variables X, Y with dis- type, at least approximately. This would allow us to conve- tributions pX , pY and values in X , Y . Suppose we have niently handle error propagation as in the example discussed access to (data from) pX and pY , and we want to compute earlier. It would also help us perform inference involving con- the distribution of the random variable f (X, Y ), where f is ditional distributions of such variables given observed data. a measurable function defined on X × Y . The latter is the main topic of a subject area that has recently For instance, if our operation is addition f (X, Y ) = begun to attract attention, probabilistic programming (Gor- X + Y , and the distributions pX and pY have densities, we don et al. 2014). A variety of probabilistic programming can compute the density of the distribution of f (X, Y ) by languages has been proposed (Wood et al. 2014; Paige and convolving those densities. If the distributions of X and Y Wood 2014; Cassel 2014). To emphasize the central role that belong to some parametric class, such as a class of distribu- kernel maps play in our approach, we refer to it as kernel tions with Gaussian density functions, and if the arithmetic probabilistic programming (KPP). expression is elementary, then closed-form solutions for cer- tain favorable combinations exist. At the other end of the 3.2 Computing functions of independent random spectrum, we can resort to numerical integration or sampling variables using kernel maps to approximate f (X, Y ). Arithmetic operations on random variables are abundant The key idea of KPP is to provide a consistent estimator of in science and engineering. Measurements in real-world the kernel map of an expression involving operations on ran- systems are subject to uncertainty, and thus subsequent arith- dom variables. This is done by applying the expression to the 123 Stat Comput (2015) 25:755–766 759 sample points and showing that the resulting kernel expan- As an aside, note that (23) is an RKHS-valued two-sample sion has the desired property. Operations involving more than U-statistic. one RV will increase the size of the expansion, but we can      , E Φ ( , ) = E Φ ( , resort to existing RKHS approximation methods to keep the Proof For any i j,wehave f xi y j f X complexity of the resulting expansion limited, which is advis- Y ) ; hence, (23) is unbiased. able in particular if we want to use it as a basis for further The convergence (24) can be obtained as a corollary to operations. The benefits of KPP are threefold. First, we do not Theorem 3, and the proof is omitted here.  make parametric assumptions on the distributions associated with the random variables. Second, our approach applies not 3.3 Approximating expansions only to real-valued random variables, but also to multivari- ate random variables, structured data, functional data, and To keep computational cost limited, we need to use approx- other domains, as long as positive definite kernels can be imations when performing multi-step operations. If for defined on the data. Finally, it does not require explicit den- instance, the outcome of the first step takes the form (23), sity estimation as an intermediate step, which is difficult in then we already have m × n terms, and subsequent steps high dimensions. would further increase the number of terms, thus quickly We begin by describing the basic idea. Let f be a func- becoming computationally prohibitive. tion of two independent RVs X, Y taking values in the We can do so by using the methods described in Chap. sets X , Y , and suppose we are given i.i.d. m-samples 18 of Schölkopf and Smola (2002). They fall in two cate- x1,...,xm and y1,...,ym. We are interested in the distri- gories. In reduced set selection methods, we provide a set ( , ) of expansion points [e.g., all points f (x , y ) in (23)], and bution of f X Y , and seek to estimate its representation i j μ f (X, Y ) := E Φ f (X, Y ) in the RKHS as the approximation method sparsifies the vector of expansion coefficients. This can be for instance done by solving eigen- value problems or linear programs. Reduced set construction m 1   methods, on the other hand, construct new expansion points. Φ f (x , y ) . (22) m2 i j In the simplest case, they proceed by sequentially finding i, j=1 approximate pre-images of RKHS elements. They tend to be computationally more demanding and suffer from local Although x ,...,x ∼ p and y ,...,y ∼ p are i.i.d. 1 m X 1 m Y minima; however, they can lead to sparser expansions. observations, this does not imply that the f (x , y )|i, j =  i j Either way, we will end up with an approximation 1,...,m form an i.i.d. m2-sample from f (X, Y ), since— loosely speaking—each xi (and each y j ) leaves a footprint p in m of the observations, leading to a (possibly weak) depen- γkΦ(zk) (25) dency. Therefore, Theorem 1 does not imply that (22)is k=1 consistent. We need to do some additional work: of (23), where usually p  m × n. Here, the zk are either Theorem 2 Given two independent random variables X, Y a subset of the f (xi , y j ), or other points from Z . with values in X , Y , mutually independent i.i.d. samples It is instructive to consider some special cases. For sim- Z = Rd x1,...,xm and y1,...,yn, a measurable function f : X × plicity, assume that . If we use a Gaussian kernel Y → Z Z × Z , and a positive definite kernel on with     RKHS map Φ, then k(x, x ) = exp −x − x 2/(2σ 2) (26) σ m n whose bandwidth is much smaller than the closest pair 1     Φ f (x , y ) (23) of sample points, then the points mapped into the RKHS mn i j i=1 j=1 will be almost orthogonal, and there is no way to sparsify a kernel expansion such as (23) without incurring a large   is an unbiased and consistent estimator of μ f (X, Y ) . RKHS error. In this case, we can identify the estimator with Moreover, we have convergence in probability the sample itself, and KPP reduces to a Monte Carlo method. If, on the other hand, we use a linear kernel k(z, z) =z, z d on Z = R , then Φ is the identity map and the expansion m n     1 (23) collapses to one real number, i.e., we would effectively Φ f (xi , y j ) − E Φ f (X, Y ) mn represent f (X, Y ) by its mean for any further processing. By i=1 j=1   choosing kernels that lie ‘in between’ these two extremes, we 1 1 retain a varying amount of information which we can thus = Op √ + √ ,(m, n →∞). (24) m n tune to our wishes, see Table 1. 123 760 Stat Comput (2015) 25:755–766

,..., (α )m 3.4 Computing functions of RKHS approximations Proposition 1 Let x 1 xm be an i.i.d. sample and ii=1 m α = E ( , ) > be constants with i=1 i 1. Assume k X X ˜ ˜ More generally, consider approximations of kernel E k(X, X) , where X and X are independent copies of xi . μ[X] and μ[Y ] Then, the convergence

  m n 2   m μˆ [X]:= α Φ (x ), μˆ [Y ]:= β Φ (y ). (27) i x i j y j E αi Φ(xi ) − μ[X] → 0 (m →∞) = = i 1 j 1 i=1

In our case, we think of (27) as RKHS-norm approximations holds true if and only if m α2 → 0 as m →∞. of the outcome of previous operations performed on random i=1 i variables. Such approximations typically have coefficients   Proof From the expansion α ∈ Rn and β ∈ Rm that are not uniform, that may not sum to one, and that may take negative values (Schölkopf and 2 Smola 2002), e.g., for conditional mean maps (Song et al. m E α Φ( ) − μ[ ] 2009; Fukumizu et al. 2013). i xi X   = We propose to approximate the kernel mean μ f (X, Y ) i 1 m   m   by the estimator = αi αsE k(xi , xs) − 2 αi E k(xi , X) i,s=1 i=1   1   μˆ f (X, Y ) := + E ( , ˜ ) m n k X X = αi = β j     i 1 j 1  2        = − α E ( , ˜ ) + α2 E ( , ) m n   1 i k X X i k X X   × α β Φ ( , ) , i i i j z f xi y j (28)   i=1 j=1 − E k(X, X˜ , where the feature map Φz defined on Z , the range of f , the assertion is straightforward.  may be different from both Φx and Φy. The expansion has m ×n terms, which we can subsequently approximate more compactly in the form (25), ready for the next operation. Note The next result shows that if our approximations (27) con- that (28) contains (23) as a special case. verge in the sense of Proposition 1, then the estimator (28) One of the advantages of our approach is that (23) and (with expansion coefficients summing to 1) is consistent. (28) apply for general data types. In other words, X , Y , Z ,..., ,..., need not be vector spaces—they may be arbitrary nonempty Theorem 3 Let x1 xm and y1 yn be mutually inde- pendent i.i.d. samples, and the constants (α )m ,(β )n sets, as long as positive definite kernels can be defined on i i=1 j j=1 m α = n β = m α2 them. satisfy i=1 i j=1 j 1. Assume i=1 i and n β2 , →∞ j=1 j converge to zero as n m . Then 3.4.1 Convergence analysis in an idealized setting

m n     Weanalyze the convergence of (28) under the assumption that αi β j Φ f (xi , y j ) − μ f (X, Y ) ,..., the expansion points are actually samples x1 xm from X i=1 j=1 and y ,...,y from Y , which is for instance the case if the ⎛ ⎞ 1 n   expansions (27) are the result of reduced set selection meth- ⎝ 2 2⎠ = Op α + β ods (cf. Sect. 3.3). Moreover, we assume that the expansion i j i j coefficients α1,...,αm and β1,...,βn are constants, i.e., independent of the samples. as m, n →∞. The following proposition gives a sufficient condition for the approximations in (27) to converge. Note that below, the Proof By expanding and taking expectations, one can see coefficients α ,...,α depend on the sample size m,butfor 1 m that simplicity we refrain from writing them as α1,m,...,αm,m; β ,...,β and likewise, for 1 n. We make this remark to ensure 2 m n     that readers are not puzzled by the below statement that m α2 → →∞ E αi β j Φ f (xi , y j ) − E Φ f (X, Y ) i=1 i 0asm . i=1 j=1 123 Stat Comput (2015) 25:755–766 761

α = / β = equals Note that in the simplest case, where i 1 m and j / α2 = / β2 = / 1 n,wehave i 1 m and j 1 n, which m n    i j α2β2E ( , ), ( , ) proves Theorem 2. It is also easy to see from the proof that we i j k f X Y f X Y do not strictly need αi = β j = 1—for the estimator i=1 j=1 i j      to be consistent, it suffices if the sums converge to 1. For a + α α β2E ( , ), ˜ , i s j k f X Y f X Y sufficient condition for this convergence, see Kanagawa and s=i j Fukumizu (2014).     + α2β β E ( , ), ( , ˜ ) i j t k f X Y f X Y i t= j 3.4.2 More general expansion sets     + α α β β E ( , ), ( ˜ , ˜ ) i s j t k f X Y f X Y To conclude our discussion of the estimator (28), we turn to s=i t= j     the case where the expansions (27) are computed by reduced ˜ ˜ set construction, i.e., they are not necessarily expressed in −2 αi β j E k f (X, Y ), f (X, Y ) i j terms of samples from X and Y . This is more difficult, and   we do not provide a formal result, but just a qualitative dis- + E ( , ), ( ˜ , ˜ ) k f X Y f X Y cussion.    = α2 β2 E ( ( , ), ( , )) To this end, suppose the approximations (27) satisfy i j [k f X Y f X Y ] i j ⎧ m ⎨    2   αi = 1 and for all i,αi > 0, (29) + − α β + α2 β2 ⎩ 1 i j i j i=1  i j i j⎫ n     ⎬ β j = 1 and for all j,βj > 0, (30) − α2( β )2 − ( α )2 β2 = i j i j ⎭ j 1 i  j i  j   and we approximate μ f (X, Y ) by the quantity (28). ×E k f (X, Y ), f (X˜ , Y˜ )      We assume that (27) are good approximations of the kernel + ( α )2 − α2 β2 means of two unknown random variables X and Y ;wealso i i j i i j assume that f and the kernel mean map along with its inverse   are continuous. Wehave no samples from X and Y , but we can ×E ( , ), ( ˜ , ) k f X Y f X Y turn (27) into sample estimates based on artificial samples X      + α2 ( β )2 − β2 and Y, for which we can then appeal to our estimator from i j j Theorem 2. i  j  j  = (  ,...,  )  = To this end, denote by X x1 xm and Y ×E k f (X, Y ), f (X, Y˜ ) (  ,...,  ) y1 yn the expansion points in (27). We construct a    sample X = (x1, x2,...) whose kernel mean is close to = α2 β2 E ( ( , ), ( , ))  i j [k f X Y f X Y ] m α Φ ( )  i=1 i x x as follows: for each i, the point x appears i j i i ⎧ ⎫ in X with multiplicity m · αi , i.e., the largest integer not ⎨ ⎬     exceeding m · αi . This leads to a sample of size at most m. + α2 β2 − α2 − β2  ⎩ i j i j ⎭ Note, moreover, that the multiplicity of xi , divided by m,dif- i j i j   fers from αi by at most 1/m, so effectively we have quantized α ×E k f (X, Y ), f (X˜ , Y˜ ) the i coefficients to this accuracy.        Since m is constant, this implies that for any ε>0, we + − α2 β2 E ( , ), ( ˜ , ) ∈ N 1 i j k f X Y f X Y can choose m large enough to ensure that  i  j       2 + α2 − β2 E ( , ), ( , ˜ ) , m m i 1 j k f X Y f X Y 1  Φx (xi ) − αi Φx (x ) <ε. (31) i j m i 1=1 i=1 which implies that the norm! in the assertion of the theorem ! 1 m 2 2 We may thus work with = Φx (xi ), which for strictly converges to zero at Op α + β under the m 1 1 i i j j positive definite kernels corresponds uniquely to the sample ˜ ˜ assumptions on αi and β j . Here, (X, Y ) is an independent X = (x1,...,xm). By the same argument, we obtain a sam- copy of (X, Y ). This concludes the proof.  ple Y = (y1,...,yn) approximating the second expansion. 123 762 Stat Comput (2015) 25:755–766

Substituting both samples into the estimator from Theorem 2 expressed as such a structural equation model with suitable leads to functions and noise terms [e.g., Peters et al. (2014)]. If we recursively substitute the parent equations, we can   1 express each X as a function of only the independent noise μˆ f (X, Y ) =   i m αˆ n βˆ ,..., i=1 i j=1 j terms U1 Up,   m n     × αˆ βˆ Φ ( , ) , Xi = gi (U1,...,Up). (34) i j z f xi y j (32) i=1 j=1 Since we know how to compute functions of independent ˆ RVs, we can try to test such a model (assuming knowledge where αˆi =m · αi /m, and βi =n · βi /n. By choos- of all involved quantities) by estimating the distance between ing sufficiently large m, n, this becomes an arbitrarily good RKHS images, approximation (in the RKHS norm) of the proposed estima- tor (28). Note, however, that we cannot claim based on this   = μ[ ]−μ ( ,..., ) 2 argument that this estimator is consistent, not the least since Xi gi U1 Up (35) Theorem 2 in the stated form requires i.i.d. samples. using the estimator described in (33) (we discuss the bivariate case in Theorem 4). It may be unrealistic to assume that we 3.4.3 Larger sets of random variables have access to all quantities. However, there is a special case where this is conceivable, which we will presently discuss. Without analysis, we include the estimator for the case of This is the case of additive noise models (Peters et al. 2014) more than two variables: Let g be a measurable function of jointly independent RVs U ( j = 1,...,p). Given i.i.d. j Y = f (X) + U, with X ⊥⊥ U. (36) j ,..., j ∼ observations u1 um U j ,wehave Such models are of interest for cause–effect inference since m   1    it is known (Peters et al. 2014) that in the generic case, a Φ g u1 ,...,u p m p m1 m p model (36) can only be fit in one direction, i.e., if (36) holds m ,...,m =1 1 p true, then we cannot simultaneously have m→∞    −−−−→ μ g U1,...,Up (33) X = g(Y ) + V, with Y ⊥⊥ V. (37) in probability. Here, in order to keep notation simple, we have assumed that the sample sizes for each RV are identical. To measure how well (36) fits the data, we define an esti- As above, we note that (i) g need not be real-valued, it mator can take values in some set Z for which we have a (possibly 2 characteristic) positive definite kernel; (ii) we can extend this m m 1 1 to general kernel expansions like (28); and (iii) if we use emp := Φ(yi ) − Φ( f (xi ) + u j ) . m m2 Gaussian kernels with width tending to 0, we can think of i=1 i, j=1 the above as a sampling method. (38)

Analogously, we define the estimator in the backward direc- 4 Dependent RVs and structural equation models tion

2 For dependent RVs, the proposed estimators are not applica- m m   bw 1 1 ble. One way to handle dependent RVs is to appeal to the fact := Φ(xi ) − Φ g(yi ) + v j . emp m m2 that any joint distribution of random variables can be written i=1 i, j=1 as a structural equation model with independent noises. This (39) leads to an interesting application of our method to the field of causal inference. Here, we assume that we are given the conditional mean We consider a model Xi = fi (PAi , Ui ),fori = 1,...,p, functions f : x → E[Y | X = x] and g : y → E[X | Y = with jointly independent noise terms U1,...,Up. Such mod- y]. els arise for instance in causal inference (Pearl 2009). Each In practice, we would apply the following procedure: we random variable Xi is computed as a function fi of its noise are given a sample (x1, y1),...,(xm, ym).Weestimatethe term Ui and its parents PAi in an underlying directed acyclic function f as well as the residual noise terms u1,...,um by graph (DAG). Every w.r.t. a DAG can be regression, and likewise for the backward function g (Peters 123 Stat Comput (2015) 25:755–766 763

  et al. 2014). Strictly speaking, we need to use separate sub- 1 m − 1 ˜ Am + Bm − μ[X]− μ g(Y ) + V → 0. samples to estimate function and noise terms, respectively, m m see Kpotufe et al. (2014). Below, we show that emp converges to 0 for additive Together with (42), this implies noise models (36). For the incorrect model (37), however, w   b will in the generic case not converge to zero. We can 1 m − 1 ˜ emp μ[X]− μ[X]− μ g(Y ) + V → 0 thus use the comparison of both values for deciding causal m m direction. and therefore Theorem 4 Suppose x1,...,xm and u1,...,um are mutu- ally independent i.i.d. samples, and y = f (x )+u . Assume   i i i μ g(Y ) + V˜ = μ[X]. further that the direction of the additive noise model is identifiable (Peters et al. 2014) and the kernel for x is char- Since the kernel is characteristic, this implies acteristic. We then have ( ) + ˜ = , emp → 0 and (40) g Y V X (in distribution) w b → 0 (41) emp with Y ⊥⊥ V˜ , which contradicts the identifiability of the  in probability as m →∞. additive noise model.

1 m Proof Equation (40) follows from Theorem 2 since = As an aside, note that Theorem 4 would not hold if in (39) m i 1 m we were to estimate μ g(Y ) + V by 1 Φ g(y ) +v 1 m2   m i=1 i i Φ(y ) − μ[Y ] → 0 and Φ( f (x ) + u ) m i m2 i, j=1 i j instead of 1 Φ g(y ) + v . m2 i, j=1 i j − μ[Y ] → 0 (all convergences in this proof are in proba- bility). bw → To prove (41), we assume that emp 0 which implies 5 Experiments

m   5.1 Synthetic data 1 Φ g(yi ) + v j − μ[X] → 0. (42) m2 i, j=1 We consider basic arithmetic expressions that involve mul- tiplication X × Y , division X/Y , and exponentiation X Y ˜ The key idea is to introduce a random variable V that has on two independent scalar RVs X and Y . Letting pX = the same distribution as V but is independent of Y and to N (3, 0.5) and pY = N (4, 0.5), we draw i.i.d. samples consider the following decomposition of the sum appearing X ={x1,...,xm} and Y ={y1,...,ym} from pX and pY . in (42): In the experiment, we are interested in the conver- gence (in RKHS norm) of our estimators to μ f (X, Y ) . m m 1    1    Since we do not have access to the latter, we use an Φ g(y ) + v = Φ g(y ) + v   2 i j 2 i i independent sample to construct a proxy μˆ f (X, Y ) = m , = m =   i j 1 i 1 2 (1/ ) , = Φz f (xi , y j ) . We found that = 100 led m m−1 i j 1 1   to a sufficiently good approximation. + Φ g(y ) + v + 2 i i k μ ,μ m = = Next, we compare three estimators, referred to as 1 2, i 1 k 1 μ m and 3 below, for sample sizes m ranging from 10 to 50: 1 1  = Φ(x ) m m i i=1 1. The sample-based estimator (23). m−1 m 2. The estimator (28) based on approximations of the ker- 1 1    + Φ g(y )+ v + μˆ [ ]:= m α Φ ( ) i i k nel means, taking the form X i= i x xi and m m  1 k=1 i=1 μˆ [Y ]:= m β Φ (y ) of μ[X] and μ[Y ], respec- =: + , j=1 j y j Am Bm tively. We used the simplest possible reduced set selection method: we randomly subsampled subsets of size m ≈ where the index for v is interpreted modulo m, for instance, 0.4 · m from X and Y, and optimized the coefficients v + := v . Since v + = x + − g(y + ) is independent of m 3 3 i k i k i k {α ,...,α  } and {β ,...,β  } to best approximate the  − 1 μ[ ] → 1 m 1 m yi , it further follows from Theorem 2 that Am X   m original kernel means (based on X and Y) in the RKHS − m−1 μ ( ) + ˜ → 0 and Bm m g Y V 0. Therefore, norm (Schölkopf and Smola 2002, Sect. 18.3). 123 764 Stat Comput (2015) 25:755–766

Fig. 1 Error of the proposed multiplication division exponentiation estimators for three arithmetic 0.2 0.04 0.1 operations—multiplication 0.08 X × Y , division X/Y ,and 0.15 0.03 exponentiation X Y —as a 0.06 function of sample size m.The 0.1 0.02 error error error reported is an average of error 0.04 30 repetitions of the 0.05 0.01 simulations. The expensive 0.02 estimator μˆ 1 [see (23)] performs μˆ 0 0 0 best. The approximation 2 [see 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 (28)] works well as sample sizes increase.

Fig. 2 a Scatter plot of the data Pair 78 (X → Y ) of causal pair 78 in the CauseEffectPairs y=f(x) x=g(y) benchmarks, along with the forward and backward function fits, y = f (x) and x = g(y). b Accuracy of cause–effect decisions on all the 81 pairs in Y the CauseEffectPairs benchmarks

X

(a) Pair 78 and regressors f,g (b) the accuracy curve

3. Analogously to the case of one variable (17 ), we may also interested in identifying whether X causes Y (denote as X → μˆ [ , ]:=( / ) m Φ ( , → look at the estimator 3 X Y 1 m i=1 z f xi Y )orY causes X (denote as Y X) based on observational yi ) , which sums only over m mutually independent data. We assume an additive noise model E = f (C) + U terms, i.e., a small fraction of all terms of (23). with C ⊥⊥ U where C, E, and U denote cause, effect, and   residual (or “unexplained”) variable, respectively. Below we For i = 1, 2, 3, we evaluate the estimates μˆ i f (X, Y ) using present a preliminary result on the CauseEffectPairs the error measure benchmark data set (Peters et al. 2014 ).  ( , ) = ( , ),...,( , )     For each causal pair X Y x1 y1 xm ym , 2 ≈ ( ) ≈ ( ) L = μˆ i f (X, Y ) −ˆμ f (X, Y ) . (43) we estimate functions y f x and x g y as least- square fits using degree 4 polynomials. We illustrate one We use (6) to evaluate L in terms of kernels. In all cases, example in Fig. 2a. Next, we compute the residuals in both 1 we employ a Gaussian RBF kernel (26) whose bandwidth directions as ui = yi − f (xi ) and v j = x j − g(y j ). Finally, parameter is chosen using the heuristic, setting σ to we compute scores X→Y and Y →X by the median of the pairwise distances of distinct data points 2 (Gretton et al. 2005). m m   1 1 Figure 1 depicts the error (43) as a function of sample X→Y := Φ(yi ) − Φ f (xi ) + u j , m m2 size m. For all operations, the error decreases as sample size i=1 i, j=1 increases. Note that Φ is different across the three oper- 2 z m m   ations, resulting in different scales of the average error in 1 1 Y →X := Φ(xi ) − Φ g(yi ) + v j . m m2 Fig. 1. i=1 i, j=1

5.2 Causal discovery via functions of kernel means Following Theorem 4, we can use the comparison between X→Y and Y →X to infer the causal direction. Specifi- We also apply our KPP approach to bivariate causal inference problem (cf. Sect. 4). That is, given a pair of real-valued 1 For simplicity, this was done using the same data; but cf. our discus- random variables X and Y with joint distribution pXY,weare sion following (38). 123 Stat Comput (2015) 25:755–766 765 cally, we decide that X → Y if X→Y < Y →X , and Epstein, B.: Some applications of the Mellin transform in . that Y → X otherwise. In this experiment, we also use a Ann. Math. Stat. 19(3), 370–379 (1948) Gaussian RBF kernel whose bandwidth parameter is chosen Ferson, S.: What Monte Carlo methods cannot do. Hum. Ecol. Risk Assess.: Int. J. 2(4), 990–1007 (1996). doi:10.1080/ using the median heuristic. To speed up the computation of 10807039609383659 X→Y and Y →X , we adopted a finite approximation of the Fukumizu, K., Bach, F., Jordan, M.I.: Kernel dimension reduction in feature map using 100 random Fourier features [see Rahimi regression. Ann. Stat. 37(4), 1871–1905 (2009) and Recht (2007) for details]. We allow the method to abstain Fukumizu, K., Gretton, A., Sun, X., Schölkopf, B.: Kernel measures δ> of conditional dependence. In: Platt, J.C., Koller, D., Singer, Y., whenever the two values are closer than 0. By increasing Roweis, S. (eds.) Advances in Neural Information Processing Sys- δ, we can compute the method’s accuracy as a function of a tems, vol. 20, pp. 489–496. Curran, Red Hook (2008) decision rate (i.e., the fraction of decisions that our method Fukumizu, K., Song, L., Gretton, A.: Kernel Bayes’ rule: Bayesian is forced to make) ranging from 100 % to 0 %. inference with positive definite kernels. J. Mach. Learn. Res. 14, 3753–3783 (2013) Figure 2b depicts the accuracy versus the decision rate Gordon, A.D., Henzinger, T.A., Nori, A.V., Rajamani, S.K.: Proba- for the 81 pairs in the CauseEffectPairs benchmark bilistic programming. In: International Conference on Software collection. The method achieves an accuracy of 80 %, which Engineering (ICSE, FOSE track) (2014) is significantly better than random guessing, when forced to Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012) infer the causal direction of all 81 pairs. Gretton, A., Bousquet, O., Smola, A.J., Schölkopf, B.: Measuring statis- tical dependence with Hilbert-Schmidt norms. In: Jain, S., Simon, H.U., Tomita, E. (eds.) Algorithmic Learning Theory: 16th Inter- national Conference, pp. 63–78. Springer, Berlin (2005) 6 Conclusions Gretton, A., Smola, A.J., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift by kernel mean matching. In: Can- dela, J.Q., Sugiyama, M., Schwaighofer, A., Lawrence, N.D. (eds.) We have developed a kernel-based approach to compute Dataset Shift in Machine Learning, pp. 131–160. MIT Press, Cam- functional operations on random variables taking values in bridge (2009) arbitrary domains. We have proposed estimators for RKHS Harmeling, S., Hirsch, M., Schlkopf, B.: On a link between kernel mean representations of those quantities, evaluated the approach maps and Fraunhofer diffraction, with an application to super- resolution beyond the diffraction limit. In: CVPR, pp. 1083–1090. on synthetic data, and showed how it can be used for IEEE (2013) cause–effect inference. While the results are encouraging, Hofmann, T., Schölkopf, B., Smola, A.J.: Kernel methods in machine the material presented in this article only describes the main learning. Ann. Stat. 36(3), 1171–1220 (2008) ideas, and much remains to be done. We believe that there is Huang, J., Smola, A., Gretton, A., Borgwardt, K., Schölkopf, B.: Cor- recting sample selection bias by unlabeled data. In: Schölkopf, significant potential for a unified perspective on probabilistic B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information programming based on the described methods, and hope that Processing Systems (NIPS), pp. 601–608. MIT Press, Cambridge some of the open problems will be addressed in future work. (2007) James, W., Stein, C.: Estimation with quadratic loss. Proc. Fourth Berke- Acknowledgments Thanks to Dominik Janzing, Le Song, and Ilya ley Symp. Math. Statist. Prob. 361–379 (1961) Tolstikhin for dicussions and comments. Jaroszewicz, S., Korzen, M.: Arithmetic operations on independent ran- dom variables: a numerical approach. SIAM J. Sci. Comput. 34(3), Open Access This article is distributed under the terms of the Creative A1241–A1265 (2012) Commons Attribution 4.0 International License (http://creativecomm Kanagawa, M., Fukumizu, K.: Recovering distributions from Gaussian ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, RKHS embeddings. In: JMLR W&CP 33 (Proc. AISTATS 2014), and reproduction in any medium, provided you give appropriate credit pp. 457–465. (2014) to the original author(s) and the source, provide a link to the Creative Kpotufe, S., Sgouritsa, E., Janzing, D., Schölkopf, B.: Consistency of Commons license, and indicate if changes were made. causal inference under the additive noise model. In: ICML (2014) Lopez-Paz, D., Muandet, K., Schölkopf, B., Tolstikhin, I.: Towards a learning theory of cause-effect inference. In: Proceedings of the 32nd International Conference on Machine Learning, JMLR: W&CP, Lille, France, in press (2015) References Milios, D.: Probability Distributions as Program Variables. Master’s thesis, School of Informatics, University of Edinburgh (2009) Bach, F.R., Jordan, M.I.: Kernel independent component analysis. J. Mooij, J.M., Peters, J., Janzing, D., Zscheischler, J., Schölkopf, B.: Mach. Learn. Res. 3, 1–48 (2002) Distinguishing cause from effect using observational data: meth- Berlinet, A., Agnan, T.C.: Reproducing Kernel Hilbert Spaces in Prob- ods and benchmarks. arXiv:1412.3773 (2014). Accessed 18 Jan ability and Statistics. Kluwer Academic Publishers, Boston (2004) 2015 Boser, B.E., Guyon, I.M., Vapnik,V.N.: A training algorithm for optimal Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via margin classifiers. In: Proceedings of the Fifth Annual Work- invariant feature representation. In: ICML, pp. 10–18 (2013) shop on Computational Learning Theory. COLT ’92, pp. 144–152. Muandet, K., Fukumizu, K., Dinuzzo, F., Schölkopf, B.: Learning from ACM, New York (1992) distributions via support measure machines. In: Bartlett, P., Pereira, Cassel, J.: Probabilistic programming with stochastic memoization: F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in implementing non-parametric bayesian inference. Math. J. 16 Neural Information Processing Systems 25, pp. 10–18. Curran (2014). doi:10.3888/tmj.16-1 Associates, Inc., Red Hook (2012) 123 766 Stat Comput (2015) 25:755–766

Muandet, K., Fukumizu, K., Sriperumbudur, B., Gretton, A., Schölkopf, Song, L., Huang, J., Smola, A., Fukumizu, K.: Hilbert space embed- B.: Kernel mean estimation and Stein effect. In: Xing, E.P., Jebara, dings of conditional distributions with applications to dynamical T. (eds.). Proceedings of the 31st International Conference on systems. In: ICML (2009) Machine Learning, JMLR W&CP, vol. 32. (2014a) Song, L., Zhang, X., Smola, A.J., Gretton, A., Schölkopf, B.: Tailoring Muandet, K., Sriperumbudur, B., Schölkopf, B.: Kernel mean esti- density estimation via reproducing kernel moment matching. In: mation via spectral filtering. In: Ghahramani, Z., Welling, M., Cohen, W.W., McCallum, A., Roweis, S. (eds.). In: Proceedings Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in of the 25th International Conference on Machine Learning, pp. Neural Information Processing Systems, vol. 27. Curran Asso- 992–999. ACM Press, New York (2008) ciates Inc. (2014b) Springer, M.: The Algebra of Random Variables, Wiley Series in Prob- Paige, B., Wood, F.: A compilation target for probabilistic programming ability and Mathematical Statistics. Wiley, Hoboken (1979) languages. In: Journal of Machine Learning Research; ICML 2014, Springer, M.D., Thompson, W.E.: The distribution of products of inde- pp. 1935–1943 (2014) pendent random variables. SIAM J. Appl. Math. 14(3), 511–526 Pearl, J.: Causality: Models, Reasoning, and Inference, 2nd edn. Cam- (1966) bridge University Press, New York (2009) Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Lanckriet, G., Peters, J., Mooij, J., Janzing, D., Schölkopf, B.: Causal discovery with Schölkopf, B.: Injective Hilbert space embeddings of probability continuous additive noise models. J. Mach. Learn. Res. 15, 2009– measures. In: The 21st Annual Conference on Learning Theory 2053 (2014) (COLT) (2008) Prasad, R.D.: Probability distributions of algebraic functions of inde- Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanck- pendent random variables. SIAM J. Appl. Math. 18(3), 614–626 riet, G.: Hilbert space embeddings and metrics on probability (1970) measures. J. Mach. Learn. Res. 99, 1517–1561 (2010) Rahimi, A., Recht, B.: Random features for large-scale kernel machines. Steinwart, I.: On the influence of the kernel on the consistency of support In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in vector machines. J. Mach. Learn. Res. 2, 67–93 (2002) Neural Information Processing Systems, vol 20, pp. 1177–1184. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New Curran Associates, Inc. (2007) York (2008) Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, Szabó, Z., Gretton, A., Póczos, B., Sriperumbudur, B.: Two-stage sam- R.C.: Estimating the support of a high-dimensional distribution. pled learning theory on distributions. arXiv:1402.1754 (2014). Neural Comput. 13(7), 1443–1471 (2001) Accessed 18 Jan 2015 Schölkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cam- Williamson, R.C.: Probabilistic arithmetic. Ph.D. thesis, Department bridge, MA, USA (2002) of Electrical Engineering, University of Queensland, St. Lucia, Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Queensland, Australia (1989) Cambridge University Press, Cambridge (2004) Wood, F., van de Meent, J.W., Mansinghka, V.: A new approach to Smola, A., Gretton, A., Song, L., Schölkopf, B.: A Hilbert space embed- probabilistic programming inference. In: Proceedings of the 17th ding for distributions. In: Proc. Algorithmic Learning Theory, pp. International Conference on Artificial Intelligence and Statistics 13–31. Springer-Verlag (2007) (AISTATS), JMLR: W&CP, vol. 33. Reykjavik, Iceland (2014) Song, L.: Learning via Hilbert space embedding of distributions. Ph.D. Zhang, K., Peters, J., Janzing, D., Schölkopf, B.: Kernel-based Condi- thesis, The School of Information Technologies, The University tional Independence Test and Application in Causal Discovery. In: of Sydney (2008) Cozman, F., Pfeffer, A. (eds.). In: 27th Conference on Uncertainty Song, L., Boots, B., Siddiqi, S.M., Gordon, G., Smola, A.J.: Hilbert in Artificial Intelligence (UAI 2011), pp. 804–813. AUAI Press, space embeddings of hidden Markov models. In: Proceedings of Corvallis, OR, USA (2011) the 27th International Conference on Machine Learning (ICML) (2010) Song, L., Gretton, A., Bickson, D., Low, Y., Guestrin, C.: Kernel belief propagation. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS) (2011)

123