<<

Kernel Conditional

Michael Arbel Arthur Gretton Gatsby Computational Neuroscience Unit, University College London {michael.n.arbel, arthur.gretton}@gmail.com

Abstract Hilbert space (RKHS): in this sense, like the Gaussian and Dirichlet processes, the kernel exponential family A nonparametric family of conditional dis- (KEF) is an infinite dimensional analogue of the finite tributions is introduced, which generalizes dimensional case, allowing to fit a much richer class of conditional exponential families using func- densities. While the maximum likelihood solution is tional in a suitable RKHS. An ill-posed in infinite dimensions, Sriperumbudur et al. algorithm is provided for learning the gener- (2017) have demonstrated that it is possible to fit the alized natural , and consistency of KEF via score matching (Hyvärinen, 2005), which en- n d n the estimator is established in the well spec- tails solving a linear system of size , where is d × ified case. In experiments, the new method the number of samples and is the problem dimension. generally outperforms a competing approach It is trivial to draw samples from such models using with consistency guarantees, and is competi- Hamiltonian Monte Carlo (Neal, 2010), since they di- tive with a deep conditional density model on rectly return the required potential energy (Rasmussen, datasets that exhibit abrupt transitions and 2003; Strathmann et al., 2015). In high dimensions, heteroscedasticity. fitting a KEF model to samples becomes challenging, however: the computational cost rises as d3, and com- plex interactions between dimensions can be difficult 1 Introduction to model. The complexity of the modelling task can be signifi- Distribution estimation is one of the most general prob- cantly reduced if a directed graphical model can be lems in . Once an estimator for a constructed over the variables, (Jordan, 1999; Pearl, distribution is learned, in principle, it allows to solve 2001), where each variable depends on a subset of par- a variety of problems such as classification, regression, ent variables (ideally much smaller than the total, as matrix completion and other prediction tasks. With in e.g. a Markov chain). In the present study, we the increasing diversity and complexity of machine extend the non-parametric family of Sriperumbudur learning problems, regressing the conditional mean of y et al., 2017 to fit conditional distributions. The nat- knowing x may not be sufficiently informative when the ural parameter of the conditional infinite exponential conditional density p(y x) is multimodal. In such cases, family is now an operator mapping the conditioning | one would like to estimate the conditional distribution variable to a function space of features of the condi- itself to get a richer characterization of the dependence tioned variable: for this reason, the score matching between the two variables y and x. In this paper, we framework must be generalised to the vector-valued address the problem of estimating conditional densities setting of Micchelli et al., 2005. We when x and y are continuous and multi-dimensional. establish consistency in the well specified case by gen- eralising the arguments of Sriperumbudur et al., 2017 Our conditional density model builds on a generalisa- to the vector-valued RKHS. While our proof allows for tion of the exponential family to infinite dimensions general vector-valued RKHSs, we provide a practical (Barron et al., 1991; Canu et al., 2006; Fukumizu, 2009; implementation for a specific case, which takes the form Gu et al., 1993; Pistone et al., 1995), where the nat- of a linear system of size n d 1. ural parameter is a function in a reproducing kernel × A number of alternative approaches have been pro- posed to the problem of conditional . Proceedings of the 21st International Conference on Artifi- cial Intelligence and (AISTATS) 2018, Lanzarote, 1The code can be found at: Spain. PMLR: Volume 84. Copyright 2018 by the author(s). https://github.com/MichaelArbel/KCEF Kernel Conditional Exponential Family

Sugiyama et al., 2010 introduced the Least-Square Con- depending on the parent variables through a deep neu- ditional Density Estimation (LS-CDE) method, which ral network. The network is trained on observed data provides an estimate of a conditional density function using stochastic gradient descent. Neural autoregres- p(y x) as a non-negative linear combination of basis sive networks have shown their effectiveness for a va- | functions. The method is proven to be consistent, and riety of practical cases and learning problems. Unlike works well on reasonably complicated learning prob- the earlier methods cited, however, consistency is not lems, although the optimal choice of basis functions guaranteed, and these methods require non-convex op- for the method is an open question (in their paper, timization, meaning that locally optimal solutions are the authors use Gaussians centered on the samples). found in practice. Earlier non-parametric methods such as variants of We begin our presentation in Section2, where we briefly Kernel Density Estimation (KDE) may also be used in present the Kernel Exponential Family. We generalise conditional density estimation (Fan et al., 1996; Hall this model to the conditional case, in our first major et al., 1999). These approaches also have consistency contribution: this requires the introduction of vector- guarantees, however their performance degrades in high- valued RKHSs and associated concepts. We then show dimensional settings (Nagler et al., 2016). Sugiyama that a generalisation of score matching may be used et al., 2010 found that kernel density approaches per- to fit the conditional density models for general vector formed less well in practice than LS-CDE. valued RKHS, subject to appropriate conditions. We Kernel Quantile Regression (KQR), introduced by call this model the kernel conditional exponential family (Takeuchi et al., 2006; Zhang et al., 2016), allows to pre- (KCEF). dict a percentile of the conditional distribution when y Our second contribution, in Section3, is an empiri- is one-dimensional. KQR is formulated as a convex op- cal estimator for the natural parameter of the KCEF timization problem with a unique global solution, and (Theorem1), with convergence guarantees in the well- the entire solution path with respect to the percentile specified case (Theorem2). In our experiments (Sec- parameter can be computed efficiently (Takeuchi et al., tion4), we empirically validate the consistency of the 2009). However, KQR applies only to one-dimensional estimator and compare it to other methods of con- outputs and, according to Sugiyama et al., 2010, the so- ditional density estimation. Our method generally lution path tracking algorithm tends to be numerically outperforms the leading alternative with consistency unstable in practice. guarantees (LS-CDE). Compared with the deep ap- It is possible to represent and learn conditional prob- proach (RNADE) which lacks consistency guarantees, abilities without specifying probability densities, via our method has a clear advantage at small training conditional mean embeddings (Grunewalder et al., 2012; sample sizes while being competitive at larger training Song et al., 2010). These are conditional expectations sizes. of (potentially infinitely many) features in an RKHS, which can be used in obtaining conditional expectations of functions in this RKHS. Such expected features are 2 Kernel Exponential Families complementary to the infinite dimensional exponential family, as they can be thought of as conditional ex- In this section we first present the kernel exponential pectations of an infinite dimensional sufficient statistic. family, which we then extend to a class of conditional This statistic can completely characterise the condi- exponential families. Finally, we provide a methodology tional distribution if the feature space is sufficiently for unnormalized density estimation within this class. rich (Sriperumbudur et al., 2010), and has consistency guarantees under appropriate smoothness assumptions. 2.1 Kernel Conditional Exponential Family Drawing samples given a conditional mean embedding can be challenging: this is possible via the Herding pro- We consider the task of estimating the density p(y) of cedure (Bach et al., 2012; Chen et al., 2010), as shown a Y with support Rd from i.i.d in (Kanagawa et al., 2016), but requires a non-convex Y ⊆ samples (Y )n . We propose to use a family of densities optimisation procedure to be solved for each sample. i i=1 parametrized by functions belonging to a Reproducing A powerful and recent deep learning approach to mod- Kernel Hilbert Space (RKHS) with positive semi- HY elling conditional densities is the Neural Autoregressive definite kernel k (Canu et al., 2006; Fukumizu, 2009; Network (Uria et al., 2013, Raiko et al., 2014 and Uria Sriperumbudur et al., 2017). This exponential family et al., 2016). These networks can be thought of as a gen- of density functions takes the form eralization of the Mixture Density Network introduced by Bishop, 2006. In brief, each variable is represented exp f, k(y, .) p(y) := q (y) h iHY f , (1) by a mixture of Gaussians, with means and variances 0 Z(f) ∈ F  

Michael Arbel, Arthur Gretton where q is a base density function on and is where for all x and x0 , Γ(x, x0 ) is a bounded linear 0 Y F the set of functions in the RKHS space such that operator from to , i.e., Γ(x, x0 ) ( ). The HY HY HY ∈ L HY Z(f) := exp f, k(y, .) q0(y)dx < . In what fol- space is said to be generated by an operator valued h iHY ∞ H lows, we callY this family the kernel exponential family reproducing kernel Γ. One practical choice for Γ is to R (KEF) by analogy to classical exponential family. f define it as: plays the role of the natural parameter while k(y, .) is the sufficient statistic. Note that with an appro- Γ(x, x0 ) = k (x, x0 )I x, x0 , (3) X HY ∀ ∈ X priate choice of the base distribution q0 and a finite dimensional RKHS , one can recover any finite where I the identity operator on and k is HY HY HY X dimensional exponential family. When is infinite- now a real-valued kernel which generates a real valued HY dimensional, however, the family can approximate a RKHS on (as in the conditional mean embedding; HX X much broader class of densities on Rd: under mild see Grunewalder et al., 2012). A simplified form of conditions, it is shown in Sriperumbudur et al., 2017 the estimator of T will be presented in Section3 for that the KEF approximates all densities of the form this particular choice for Γ and will be used in the q (y) exp (f(y) A) f C ( ) , A being the nor- experimental setup in Section4. { 0 − | ∈ 0 Y } malizing constant and C0( ) the set of continuous Y We will now express Tx(y) in a convenient form that will functions with vanishing tails. allow to extend the KEF. For a given x, recalling that Given two subsets and of d and p respectively, Tx belongs to , one can write Tx(y) = Tx, k(y, .) R R HY h iHY Y X for all y in . Using the reproducing property in (2), we now propose to extend the KEF to a family of con- Y ditional densities p(y x). We modify equation (1) by one further gets Tx(y) = T, Γxk(y, .) . By consid- h iH f| ering the subset of such that for all x in the making the function depend on the conditioning vari- T H X able x. The parameter f is a function of two variables integral x and y, f : R such that y f(x, y) belongs X × Y → 7→ to the RKHS for all x in . In all that follows, we Z(Tx) := q0(y) exp( T, Γxk(y, .) )dy < HY X h iH ∞ will denote by T the mapping ZY is finite, we define the kernel conditional exponential T : x Tx X → HY 7→ family (KCEF) as the set of conditional densities such that Tx(y) = f(x, y) for all y in Y exp T, Γxk(y, .) pT (y x) := q (y) h iH T . (4) We next consider how to enforce a smoothness require- | 0 Z(T ) ∈ T  x  ment on T to make the conditional density estimation problem well-posed. To achieve this, we will require Here T plays the role of the natural parameter while that the mapping T belongs to a vector valued RKHS Γxk(y, .) is the sufficient statistic. When T is restricted : we now briefly review the associated theory, follow- to be constant with respect to x, we recover the kernel H ing (Micchelli et al., 2005). A Hilbert space ( , ., . ) exponential family (KEF). The KCEF is therefore an H h iH of functions T : taking values in a vector space extension of the KEF introduced in Sriperumbudur X → HY is said to be a vector valued RKHS if for all x et al., 2017. It is also a special case of the family in- HY ∈ X and h , the linear functional T h, Tx is troduced in Canu et al., 2006. In the latter, the inner ∈ HY 7→ h iHY continuous. The reproducing property for vector-valued product is given by T, φ(x, y) where φ is a general h iH RKHSs follows from this definition. By the Riesz repre- feature of x and y. In the present work, φ has the sentation theorem, for each x and h , there particular form φ(x, y) = Γxk(y, .). This allows to fur- ∈ X ∈ HY exists a linear operator Γx from to such that for ther express pT (y x) for a given x as an element in a HY H | all T , KEF with sufficient statistic k(y, .), by using the iden- ∈ H tity T, Γxk(y, .) = Tx, k(y, .) . This is desirable h iH h iHY h, Tx = T, Γxh (2) since p (y x) remains in the same KEF family as x h iHY h iH T | varies and only the natural parameter Tx changes. Considering the dual operator Γx∗ from to , we H HY also get 2.2 Unnormalized density estimation

Γ∗ T = T . x x Given i.i.d samples (X ,Y )n in following a joint i i i=1 X ×Y distribution π(x)p (y x), where π defines a marginal 0 | We can define a vector-valued reproducing kernel by distribution over and p (y x) is a conditional den- X 0 | composing the operator Γx with its dual, sity function, we are interested in estimating p0 from n the samples (Xi,Yi)i=1. Our goal is to find the op- Γ(x, x0) = Γ∗ Γ , x x0 timal conditional density pT in the KCEF that best Kernel Conditional Exponential Family

approximates p0. The intractability of the normaliz- conditions to obtain this expression are satisfied under ing constant Z(Tx) makes maximum likelihood esti- assumptions in Appendix A.4, as proved in Theorem3 mation difficult. Sriperumbudur et al., 2017 used a of Appendix B.1 . The expression is further simplified score-matching approach (see Hyvärinen, 2005) to avoid using the reproducing property for the derivatives of this normalizing constant; in the case of the KCEF, functions in an RKHS (Lemma3 of AppendixC), however, the score function between π(x)p0(y x) and | ∂iTx(y) = T, Γx∂ik(y, .) π(x)pT (y x) contains additional terms that involve the h iH | 2 2 derivatives of the log-partition function log Z(Tx) with ∂i Tx(y) = T, Γx∂i k(y, .) h iH respect to x. Instead, we now propose a different ap- proach with a modified version of the score-matching which leads to: objective. d 1 2 J(T ) = E T, Γx∂ik(y, .) + T, ξi(x, y) We define the expected conditional score between 2 i=1 H H two conditional densities p (y x) and q(y x) under a h X i 0 | | marginal density π on x to be: with: 2 ξi(x, y) = Γx(∂i k(y, .) + ∂i log q0(y)∂ik(y, .)). (5)

We introduced the notation J(T ) := J(p0 pT ) J(p0 q) := π(x) (p0(. x) q(. x))dx || − | J | || | J(p0 q0) for convenience. This formulation depends ZX || on p (y x) only through an expectation, therefore a 0 | where: Monte Carlo estimator of the score can be derived as a 2 quadratic functional of T in the RKHS , 1 p0(y x) H (p0(. x) q(. x)) = p0(y x) y log | dy. J | k | 2 | ∇ q(y x) ˆ 1 1 2 ZY | J(T ) = T, ΓXb ∂ik(Yb,.) + T, ξi(Xb,Yb) . n 2 H h H b [n] ∈ For a fixed value x in , (p (. x ) q(. x)) is the score- iX[d] X J 0 | || | ∈ matching function between p (. x) and q(. x) as defined 0 | | ˆ by Hyvärinen, 2005. We further take the expectation Note that the objective functions J(T ) and J(T ) can be defined over the whole space , whereas J(p p ) over x to define a divergence over conditional densities. H 0| T The normalizing constant of q(y x), which is a function is meaningful only if T belongs to . | T of x, is never involved in this formulation, as we take the gradient of the log-densities over y only. For a 3 Empirical KCEF and consistency conditional density p (y x) that is supported on the 0 | whole domain for all x in , the expected conditional In this section, we will first estimate the optimal T = Y X ∗ score is well behaved in the sense that J(p0 q) is always argminT J(T ) over the whole space by minimizing | ∈H H non-negative, and reaches 0 if and only if the two a regularized version of the quadratic form in equation conditional distributions p0(y x) and q(y x) are equal Jˆ(T ), then we will state conditions under which all of | | for π-almost all x. More discussion can be found in the obtained solutions belong to defining therefore T AppendixD, for the case when this condition fails to conditional densities in the KCEF. hold. The goal is then to find a conditional distribution p T Following Sriperumbudur et al., 2017, we define the T in the KCEF for a given that minimizes this ˆ ∈ T kernel ridge estimator to be Tn,λ = argminT J(T ) + score over the whole family. λ 2 ∈H 2 T where T is the RKHS norm of T . Tn,λ is k kH k kH Under mild regularity conditions on the densities (see then obtained by solving a linear system of nd variables Hyvärinen, 2005; Sriperumbudur et al., 2017, and be- as shown in the next theorem: low), the score can be rewritten Theorem 1. Under assumptions listed in Ap- pendix A.4, and in particular if Γ(x, x) is uni- d k kOp 2 1 2 formly bounded on for the operator norm, then the J(p0 pT ) = E ∂ Tx(y) + (∂iTx(y)) || i 2 X i minimizer Tn,λ exists, is unique, and is given by h X=1 i d 1 ˆ + E ∂iTx(y)∂i log q0(y) + J(p0 q0) Tn,λ = ξ + β(b,i)ΓXb ∂ik(Yb, ), || −λ · i=1 b [n];i [d] h X i ∈ X∈ where J(p q ) is a constant term for the optimization where 0|| 0 problem and the expectation is taken over π(x)p0(y x). 1 | ξˆ = ξ (X ,Y ), All derivatives are with respect to y, and we used n i b b the notation ∂ f(y) = ∂ f(y). In the case of KCEF, b [n];i [d] i ∂yi ∈ X∈ Michael Arbel, Arthur Gretton

and ξi are given by (5). β(b,i) denotes the (b 1)d + i the estimator Tn,λ to T0 and the convergence of the nd − entry of a vector β in , obtained by solving the corresponding density pTn,λ (y x) to the true density R | linear system p0(y x). First, we consider the covariance operator C of | the joint feature Γxk(y, ) under the joint distribution of h · (G + nλI)β = , x and y, as introduced in Theorem3 of Appendix B.1, λ and we denote by (Cγ ) the range space of the operator R Cγ . We then have the following consistency result: where G is an nd by nd Gram matrix, and h is a vector in Rnd, Theorem 2. Let γ > 0 be a positive parameter and 1 1 1 1 define α = max( 2(γ+1) , 4 ) ( 4 , 2 ). Under the condi- ∈ α γ (G)(a,i),(b,j) = ΓXa ∂ik(Ya, ), ΓXb ∂jk(Yb, ) tions in Appendix A.4, for λ = n− , and if T (C ), h · · iH 0 ∈ R ˆ then (h)(b,i) = ξ, ΓXb ∂ik(Ya, ) . h · iH 1 +α T T = (n− 2 ). k n,λ − 0k Op0 The result is proved in Theorem4 of Appendix B.2. Furthermore, if supy k(y, y) < , then For the particular choice of Γ in (3), the estimator ∈Y ∞ 1+2α takes a simplified form KL(p0 pTn,λ ) = p (n− ). || O 0 1 ˆ Tn,λ(x, y) = ξ(x, y) + β(b,i)k (Xb, x)∂ik(Yb, y), These asymptotic rates match those obtained for the − λ X b [n] unconditional density estimator in Sriperumbudur et ∈ iX[d] ∈ al., 2017. The smoother the parameter T0, the closer 1 α gets to 4 , which in turns leads to a convergence with 1 rate in KL divergence of the order of √n . The worst ˆ 1 2 case scenario happens when the range-space parameter ξ(x, y) = k (Xb, x)∂i k(Yb, y) n X γ gets closer to 0, in which case convergence in KL b [n];i [d] 1 ∈ X∈ divergence happens at a rate close to γ . A more 1 n + k (Xb, x)∂i log q0(Yb)∂ik(Yb, y) technical formulation of this theorem along with a n X b [n];i [d] proof is presented in Appendix B.3 (see Theorems5 ∈ X∈ and6). The coefficients β are obtained by solving the same The regularity of the conditional density p(y x) with re- h | system (G + nλI)β = λ , where G and h reduce to spect to x is captured by the boundedness assumption on the operator valued kernel Γ; i.e., Γ(x, x) κ k kop ≤ (G)(a,i),(b,j) =k (Xa,Xb)∂i∂j+dk(Ya,Yb), for all x in Assumption( E). This assumption X ∈ X ˆ allows to control the variations of the conditional dis- (h)(b,i) =∂iξ(Xb,Yb), tribution p(y x) as x changes. Roughly speaking, we | y may estimate the conditional density p(y x0) at a given and all derivatives are taken with respect to . | point x0 from samples (Yi,Xi) whenever there are Xi The above estimator generalizes the estimator in Sripe- sufficiently close to x0. The uniformly bounded ker- rumbudur et al., 2017 to conditional densities. In fact, nel Γ allows to express the objective function J(T ) as if one choses the kernel k to be a constant kernel, 1 X a quadratic form J(T ) = T,CT + T, ξ + c0 then one exactly recovers the setting of the KEF. 2 h iH h iH for constant c0, where C is the covariance operator This linear system has a complexity of (n3d3) in time introduced in Theorem3. Furthermore, this bound- O and (n2d2) in memory, which can be problematic for edness assumption ensures that C is a "well-behaved" O higher dimensions d as n grows. However, in practice, if operator, namely a positive semi-definite trace-class the goal is to estimate a density of the form p(x1, ..., xd), operator. The population solution of the regularized 1 one can use the general chain rule for distributions, score objective is then given by Tλ = (C + λI)− CT0 ˆ ˆ 1 ˆ p(x1)p(x2 x1)....p(xd x1, ..., xd 1), and estimate each while the estimator is given by: Tλ,n = (C + λI)− ξ | | − − conditional density p(xi x1, ..., xi 1) using the KCEF where Cˆ and ξˆ are empirical estimators for C and ξ. | − in (4). This reduces the complexity of the algorithm The proof of consistency makes use of ideas from Capon- to (n3d). A reduction to the cubic complexity in O netto et al., 2007; Sriperumbudur et al., 2017, exploit- the number of data points n could be managed via a ing the properties of trace-class operators. The main Nyström-like approximation (Sutherland et al., 2017). idea is to first control the error T0 Tˆλ,n by intro- k − kH In the well-specified case where the true conditional ducing the population solution Tλ, density p (y x) is assumed to be in (4) (i.e. p (y x) = 0 | 0 | ˆ ˆ p (y x)), we analyze the parameter convergence of T0 Tλ,n T0 Tλ + Tλ Tλ,n T0 | k − kH ≤ k − kH k − kH Kernel Conditional Exponential Family

The first term T0 Tλ represents the regularization sive model with 100 units per layer. The model con- k − kH error which is introduced by adding a regularization sists of a product of conditional densities of the form term λI to the operator C. This term doesn’t depend Πd p(X X , θ, o), where o is a permutation of the i=1 oi | o

M) performed slightly better than the other methods in KCEF NADE LSCDE terms of speed of convergence as sample size increases. caution 0.99 0.01 4.12 0.02 1.19 0.02 ± ± ± ftcollinssnow 1.46 0.0 3.09 0.02 1.56 0.01 The variants that exploit the Markov structure of data ± ± ± highway 1.17 0.01 11.02 1.05 1.98 0.04 M lead to the best performance for both KCEF and ± ± ± heights 1.27 0.0 2.71 0.0 1.3 0.0 ± ± ± LSCDE as expected. The NADE method has com- sniffer 0.33 0.01 1.51 0.04 0.48 0.01 ± ± ± parable performance for large sample sizes, but the snowgeese 0.72 0.02 2.9 0.15 1.39 0.05 ± ± ± GAGurine 0.46 0.0 1.66 0.02 0.7 0.01 performance drops significantly for small sample sizes. ± ± ± geyser 1.21 0.04 1.43 0.07 0.7 0.01 This behaviour will also be observed in subsequent ex- ± ± ± topo 0.67 0.01 4.26 0.02 0.83 0.0 ± ± ± periments on real data. The figure on the right shows BostonHousing 0.3 0.0 3.46 0.1 1.13 0.01 ± ± ± the evolution of the log-likelihood per dimension as CobarOre 3.42 0.03 4.7 0.02 1.61 0.02 ± ± ± engel 0.18 0.0 1.46 0.02 0.76 0.01 the dimension increases. In the F case, our approach ± ± ± mcycle 0.56 0.01 2.24 0.01 0.93 0.01 is comparable to LSCDE with an advantage in small ± ± ± BigMac2003 0.59 0.01 13.8 0.13 1.63 0.03 dimensions. The F approaches both use an anisotropic ± ± ± RBF kernel with tuned per-dimension bandwidth which Table 1: Mean and std. deviation of the negative end up performing a kind of automatic relevance deter- log-likelihood on benchmark data over 20 runs, with mination. This helps getting comparable performance different random splits. In all cases d = 1. Best to the M methods. A drastic drop in performance y method in boldface (two-sided paired t-test at 5%). can happen when an isotropic kernel is used instead as confirmed by Figure5 of AppendixE. Finally, NADE, white -wine parkinsons red wine which is also agnostic to the Markov structure of data, KCEF-F 13.05 0.36 2.86 0.77 11.8 0.93 ± ± ± KCEF-M 14.36 0.37 5.53 0.79 13.31 0.88 seems to achieve comparable performance to the F ± ± ± LSCDE-F 13.59 0.6 15.89 1.48 14.43 1.5 methods with a slight disadvantage in higher dimen- ± ± ± LSCDE-M 14.42 0.66 10.22 1.45 14.06 1.36 ± ± ± sions. NADE 10.55 0.0 3.63 0.0 9.98 0.0 ± ± ± Real data: We applied the proposed and existing methods to the R package benchmark datasets (Team, Table 2: UCI results: average and standard deviation 2008) (see Table1) as well as three UCI datasets pre- of the negative log-likelihood over 5 runs with different viously used to study the performance of other density random splits. Best method in boldface (two-sided estimators (see Silva et al., 2011; Tang et al., 2012; paired t-test at 5%). Uria et al., 2013). In all cases data are centered and normalized. performance by reducing the number of parameters to train and by introducing early stopping. First, the R benchmark datasets are low dimensional with few samples, but with a relatively complex condi- The UCI datasets (Red Wine, White Wine and Parkin- tional dependence between the variables. This setting sons) represent challenging datasets with non-linear allows to compare the methods in terms of data ef- dependencies and abrupt transitions between high and ficiency and overfitting. Each dataset was randomly low density regions. This makes the densities diffi- split into a training and a test set of equal size. The cult to model using standard tools such as mixtures models are trained to estimate the conditional den- of Gaussians or factor analysis. They also contain sity of a one dimensional variable y knowing x using enough training sample points to allow a stronger per- n samples (xi, yi)i=1 form the training set. The accu- formance by NADE. All discrete-valued variables were racy is measured by the negative log-likelihood for eliminated as well as one variable from every pair of n the test samples (xi, yi)i=1 averaged over 20 random variables that are highly correlated (Pearson correla- splits of data. We compared the proposed method with tion greater than 0.98). Following Uria et al., 2013, NADE and LSCDEe eon 14 datasets. For NADE we 90% of the data were used for training while 10% were used CV over the number of units per layer 2, 10, 100 held-out for testing. Two different graph factorizations { } and number of mixture components 1, 2, 5, 10 for a (F, M ) were used for the proposed method and for { } 2 layer network. We also used CV to chose the hyper- LSCDE. parameters for LSCDE and the proposed method on In Table2, we report the performance of the different a 20 20 grid (for λ and σ). × models. Our method was among the statistically sig- The experimental results are summmarized in Table1. nificant group of best models on Parkinsons dataset LSCDE worked well in general as claimed in the orig- according to the two-sided paired t test at significance − inal paper, however the proposed method substantially level of 5%. On the remaining datasets, it achieved the improves the results. On the other hand, NADE per- second best performance after NADE. formed rather poorly due to the small sample size of Sampling: We compare samples generated from the training set, despite our attempts to improve its the approximate distribution obtained using differ- Kernel Conditional Exponential Family

log-likelihood vs number of samples log-likelihood vs data dimension − 1.100 0.2 KCEF-F − 1.125 KCEF-M 0.0 LSCDE-F − 1.150 LSCDE-M − 0.2 − 1.175 RNADE

− 1.200 − 0.4 KCEF: F − 1.225 − 0.6 KCEF: M − 1.250 LSCDE: F LSCDE: M − 0.8 − 1.275 NADE − 1.300 200 400 600 800 1000 1200 5.0 7.5 10.0 12.5 15.0 17.5 20.0 N d

Figure 1: Experimental comparison of proposed method KCEF and other methods ( LSCDE and NADE ) on synthetic grid dataset. LEFT: log-likelihood vs training samples size, (d = 3). RIGHT: log-likelihood per dimension vs dimension, N = 2000. The log-likelihood is evaluated on a separate test set of size 2000.

Red Wine Parkinsons HKEF

2 2

7 Table 3: p-values for the relative similarity test. 16 x x Columns represents the p-values for testing whether 0 0 samples from KEF ( resp. KCEF) model are closer to the data than samples from the KCEF (resp. NADE). − 2 − 2 − 2 0 2 − 1 1 x6 x15 We also performed a test of relative similarity between the generated samples and the ground truth data fol- Figure 2: Scatter plot of 2-d slices of red wine and lowing the methodology and code of Bounliphone et al., parkinsons data sets.Black points are real data, red are 2015. Given samples from data Xm and generated sam- samples from the KCEF. ples Yn and Zr from two different methods, we test the hypothesis that Px is closer to Pz than Py according to the MMD metric. The null hypothesis ent methods (KEF, KCEF, NADE). To get samples : MMD(P ,P ) MMD(P ,P ) (X1, ..., Xd) from the joint distribution of KCEF we Hy

References Jordan, Michael I. (1999). Learning in Graphical Mod- els. Ed. by Michael I. Jordan. Cambridge, MA, USA: Bach, F., S. Lacoste-Julien, and G. Obozinski (2012). MIT Press. “On the equivalence between herding and conditional Kanagawa, M., Y. Nishiyama, A. Gretton, and K. Fuku- gradient algorithms.” In: ICML. mizu (2016). “Filtering with State-Observation Ex- Barron, A. and C-H. Sheu (1991). “Approximation of amples via Kernel Monte Carlo Filter.” In: Neural density functions by sequences of exponential fami- Computation 28.2, pp. 382–444. lies.” In: Annals of Statistics 19.3, pp. 1347–1369. Micchelli, Charles A. and Massimiliano A. Pontil (2005). Bishop, Christopher M. (2006). Pattern Recognition “On Learning Vector-Valued Functions.” In: Neural and Machine Learning (Information Science and Comput. 17.1, pp. 177–204. url: http://dx.doi. Statistics). Secaucus, NJ, USA: Springer-Verlag New org/10.1162/0899766052530802. York, Inc. Nagler, Thomas and Claudia Czado (2016). “Evading Bounliphone, Wacha, Eugene Belilovsky, Matthew B. the Curse of Dimensionality in Nonparametric Den- Blaschko, Ioannis Antonoglou, and Arthur Gretton sity Estimation with Simplified Vine Copulas.” In: (2015). “A Test of Relative Similarity For Model J. Multivar. Anal. 151.C, pp. 69–89. url: https: Selection in Generative Models.” In: CoRR. eprint: //doi.org/10.1016/j.jmva.2016.07.003. 1511.04581. url: https://arxiv.org/abs/1511. Neal, Radford M. (2010). “MCMC using Hamiltonian 04581. dynamics.” In: Handbook of Markov Chain Monte Canu, Stephane and Alex J. Smola (2006). “Kernel Carlo. eprint: 1206 . 1901. url: https : / / arxiv . methods and the exponential family.” In: Neurocom- org/abs/1206.1901. puting 69.7, pp. 714–720. Pearl, J. (2001). Causality: Models, Reasoning and Caponnetto, A. and E. De Vito (2007). “Optimal Inference. Cambridge University Press. Rates for the Regularized Least-Squares Algorithm.” Pistone, G. and C. Sempi (1995). “An infinite- In: Foundations of Computational Mathematics 7.3, dimensional geometric structure on the space of all pp. 331–368. url: http://dx.doi.org/10.1007/ the probability measures equivalent to a given one.” s10208-006-0196-8. In: Annals of Statistics 23.5, pp. 1543–1561. Chen, Y., M. Welling, and A. Smola (2010). “Super- Raiko, Tapani, Li Yao, Kyunghyun Cho, and Yoshua samples from Kernel-Herding.” In: Proceedings of the Bengio (2014). “Iterative Neural Autoregressive Dis- Conference on Uncertainty in Artificial Intelligence. tribution Estimator (NADE-k).” In: NIPS. eprint: Fan, Jianqing, Qiwei Yao, and Howell Tong (1996). 1406.1485. url: https://arxiv.org/abs/1406. “Estimation of conditional densities and sensitiv- 1485. ity measures in nonlinear dynamical systems.” In: Rasmussen, C.E. (2003). “Gaussian Processes to Speed Biometrika 6704. url: https://ideas.repec.org/ up Hybrid Monte Carlo for Expensive Bayesian Inte- p/ehl/lserod/6704.html. grals.” In: 7, pp. 651–659. Fukumizu, Kenji (2009). “Exponential manifold by re- Mixed Cumulative Distribution Networks (2011). eprint: producing kernel Hilbert spaces.” In: Algebraic and 1008.5386. url: https://arxiv.org/abs/1008. Geometric Methods in Statistics. Cambridge Univer- 5386. sity Press, pp. 291–306. Song, L., B. Boots, S. Siddiqi, G. Gordon, and A. Smola Grunewalder, S., G. Lever, L. Baldassarre, S. Patter- (2010). “Hilbert Space Embeddings of Hilbert Markov son, A. Gretton, and M. Pontil (2012). “Conditional Models.” In: International Conference on Machine mean embeddings as regressors.” In: International Learning (ICML). Conference on Machine Learning (ICML). Sriperumbudur, Bharath, Kenji Fukumizu, Arthur Gu, C. and C. Qiu (1993). “Smoothing spline density Gretton, Aapo Hyvärinen, and Revant Kumar (2017). estimation: Theory.” In: Annals of Statistics 21.1, “Density Estimation in Infinite Dimensional Expo- pp. 217–234. nential Families.” In: Journal of Machine Learning Hall, Peter, Rodney C. L. Wolff, and Qiwei Yao (1999). Research 18.57, pp. 1–59. url: http://jmlr.org/ “Methods for Estimating a Conditional Distribution papers/v18/16-011.html. Function.” In: Journal of the American Statistical Sriperumbudur, Bharath, Arthur Gretton, Kenji Fuku- Association 94.445, pp. 154–163. url: http://www. mizu, Bernhard Schoelkopf, and Gert Lanckriet jstor.org/stable/2669691. (2010). “Hilbert Space Embeddings and Metrics on Hyvärinen, Aapo (2005). “Estimation of non- Probability Measures.” In: Journal of Machine Learn- normalized statistical models by score matching.” In: ing Research 11, pp. 1517–1561. Journal of Machine Learning Research 6, pp. 695– Strathmann, Heiko, Dino Sejdinovic, Samuel Living- 709. stone, Zoltan Szabo, and Arthur Gretton (2015). “Gradient-free Hamiltonian Monte Carlo with Effi- Kernel Conditional Exponential Family

cient Kernel Exponential Families.” In: Advances in Neural Information Processing Systems 28. Ed. by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett. Curran Associates, Inc., pp. 955– 963. url: http://papers.nips.cc/paper/5890- gradient - free - hamiltonian - monte - carlo - with-efficient-kernel-exponential-families. pdf. Sugiyama, Masashi, Ichiro Takeuchi, Taiji Suzuki, Taka- fumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara (2010). “Least-Squares Conditional Den- sity Estimation.” In: IEICE Transactions 93-D.3, pp. 583–594. url: http://search.ieice.org/bin/ summary.php?id=e93-d_3_583. Sutherland, Dougal J., Heiko Strathmann, Michael Arbel, and Arthur Gretton (2017). “Efficient and principled score estimation.” In: eprint: 1705.08360. url: https://arxiv.org/abs/1705.08360. Takeuchi, Ichiro, Quoc V. Le, Timothy D. Sears, and Alexander J. Smola (2006). “Nonparametric Quantile Estimation.” In: J. Mach. Learn. Res. 7, pp. 1231– 1264. url: http://dl.acm.org/citation.cfm? id=1248547.1248592. Takeuchi, Ichiro, Kaname Nomura, and Takafumi Kanamori (2009). “Nonparametric Conditional Den- sity Estimation Using Piecewise-linear Solution Path of Kernel Quantile Regression.” In: Neural Comput. 21.2, pp. 533–559. url: http://dx.doi.org/10. 1162/neco.2008.10-07-628. Tang, Yichuan, Ruslan Salakhutdinov, and Geoffrey Hinton (2012). “Deep Mixtures of Factor Analysers.” In: Proceedings of the 29th International Conference on Machine Learning, 2012, Edinburgh, Scotland. url: https://arxiv.org/abs/1206.4635. Team, R Development Core (2008). The R Manuals. Uria, Benigno, Marc-Alexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle (2016). “Neural Autoregressive Distribution Estimation.” In: Jour- nal of Machine Learning Research 17. eprint: 1605. 02226. url: https://arxiv.org/abs/1605.02226. Uria, Benigno, Iain Murray, and Hugo Larochelle (2013). “RNADE: The real-valued neural autoregres- sive density-estimator.” In: eprint: 1306.0186. url: https://arxiv.org/abs/1306.0186. Zhang, Chong, Yufeng Liu, and Yichao Wu (2016). “On Quantile Regression in Reproducing Kernel Hilbert Spaces with the Data Sparsity Constraint.” In: Jour- nal of Machine Learning Research 17.40, pp. 1–45. url: http://jmlr.org/papers/v17/zhang16a. html.