Kernel Conditional Exponential Family

Kernel Conditional Exponential Family Michael Arbel Arthur Gretton Gatsby Computational Neuroscience Unit, University College London {michael.n.arbel, arthur.gretton}@gmail.com Abstract Hilbert space (RKHS): in this sense, like the Gaussian and Dirichlet processes, the kernel exponential family A nonparametric family of conditional dis- (KEF) is an infinite dimensional analogue of the finite tributions is introduced, which generalizes dimensional case, allowing to fit a much richer class of conditional exponential families using func- densities. While the maximum likelihood solution is tional parameters in a suitable RKHS. An ill-posed in infinite dimensions, Sriperumbudur et al. algorithm is provided for learning the gener- (2017) have demonstrated that it is possible to fit the alized natural parameter, and consistency of KEF via score matching (Hyvärinen, 2005), which en- n d n the estimator is established in the well spec- tails solving a linear system of size , where is d × ified case. In experiments, the new method the number of samples and is the problem dimension. generally outperforms a competing approach It is trivial to draw samples from such models using with consistency guarantees, and is competi- Hamiltonian Monte Carlo (Neal, 2010), since they di- tive with a deep conditional density model on rectly return the required potential energy (Rasmussen, datasets that exhibit abrupt transitions and 2003; Strathmann et al., 2015). In high dimensions, heteroscedasticity. fitting a KEF model to samples becomes challenging, however: the computational cost rises as d3, and com- plex interactions between dimensions can be difficult 1 Introduction to model. The complexity of the modelling task can be signifi- Distribution estimation is one of the most general prob- cantly reduced if a directed graphical model can be lems in machine learning. Once an estimator for a constructed over the variables, (Jordan, 1999; Pearl, distribution is learned, in principle, it allows to solve 2001), where each variable depends on a subset of par- a variety of problems such as classification, regression, ent variables (ideally much smaller than the total, as matrix completion and other prediction tasks. With in e.g. a Markov chain). In the present study, we the increasing diversity and complexity of machine extend the non-parametric family of Sriperumbudur learning problems, regressing the conditional mean of y et al., 2017 to fit conditional distributions. The nat- knowing x may not be sufficiently informative when the ural parameter of the conditional infinite exponential conditional density p(y x) is multimodal. In such cases, family is now an operator mapping the conditioning | one would like to estimate the conditional distribution variable to a function space of features of the condi- itself to get a richer characterization of the dependence tioned variable: for this reason, the score matching between the two variables y and x. In this paper, we framework must be generalised to the vector-valued address the problem of estimating conditional densities kernel regression setting of Micchelli et al., 2005. We when x and y are continuous and multi-dimensional. establish consistency in the well specified case by gen- eralising the arguments of Sriperumbudur et al., 2017 Our conditional density model builds on a generalisa- to the vector-valued RKHS. While our proof allows for tion of the exponential family to infinite dimensions general vector-valued RKHSs, we provide a practical (Barron et al., 1991; Canu et al., 2006; Fukumizu, 2009; implementation for a specific case, which takes the form Gu et al., 1993; Pistone et al., 1995), where the nat- of a linear system of size n d 1. ural parameter is a function in a reproducing kernel × A number of alternative approaches have been pro- posed to the problem of conditional density estimation. Proceedings of the 21st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, 1The code can be found at: Spain. PMLR: Volume 84. Copyright 2018 by the author(s). https://github.com/MichaelArbel/KCEF Kernel Conditional Exponential Family Sugiyama et al., 2010 introduced the Least-Square Con- depending on the parent variables through a deep neu- ditional Density Estimation (LS-CDE) method, which ral network. The network is trained on observed data provides an estimate of a conditional density function using stochastic gradient descent. Neural autoregres- p(y x) as a non-negative linear combination of basis sive networks have shown their effectiveness for a va- | functions. The method is proven to be consistent, and riety of practical cases and learning problems. Unlike works well on reasonably complicated learning prob- the earlier methods cited, however, consistency is not lems, although the optimal choice of basis functions guaranteed, and these methods require non-convex op- for the method is an open question (in their paper, timization, meaning that locally optimal solutions are the authors use Gaussians centered on the samples). found in practice. Earlier non-parametric methods such as variants of We begin our presentation in Section2, where we briefly Kernel Density Estimation (KDE) may also be used in present the Kernel Exponential Family. We generalise conditional density estimation (Fan et al., 1996; Hall this model to the conditional case, in our first major et al., 1999). These approaches also have consistency contribution: this requires the introduction of vector- guarantees, however their performance degrades in high- valued RKHSs and associated concepts. We then show dimensional settings (Nagler et al., 2016). Sugiyama that a generalisation of score matching may be used et al., 2010 found that kernel density approaches per- to fit the conditional density models for general vector formed less well in practice than LS-CDE. valued RKHS, subject to appropriate conditions. We Kernel Quantile Regression (KQR), introduced by call this model the kernel conditional exponential family (Takeuchi et al., 2006; Zhang et al., 2016), allows to pre- (KCEF). dict a percentile of the conditional distribution when y Our second contribution, in Section3, is an empiri- is one-dimensional. KQR is formulated as a convex op- cal estimator for the natural parameter of the KCEF timization problem with a unique global solution, and (Theorem1), with convergence guarantees in the well- the entire solution path with respect to the percentile specified case (Theorem2). In our experiments (Sec- parameter can be computed efficiently (Takeuchi et al., tion4), we empirically validate the consistency of the 2009). However, KQR applies only to one-dimensional estimator and compare it to other methods of con- outputs and, according to Sugiyama et al., 2010, the so- ditional density estimation. Our method generally lution path tracking algorithm tends to be numerically outperforms the leading alternative with consistency unstable in practice. guarantees (LS-CDE). Compared with the deep ap- It is possible to represent and learn conditional prob- proach (RNADE) which lacks consistency guarantees, abilities without specifying probability densities, via our method has a clear advantage at small training conditional mean embeddings (Grunewalder et al., 2012; sample sizes while being competitive at larger training Song et al., 2010). These are conditional expectations sizes. of (potentially infinitely many) features in an RKHS, which can be used in obtaining conditional expectations of functions in this RKHS. Such expected features are 2 Kernel Exponential Families complementary to the infinite dimensional exponential family, as they can be thought of as conditional ex- In this section we first present the kernel exponential pectations of an infinite dimensional sufficient statistic. family, which we then extend to a class of conditional This statistic can completely characterise the condi- exponential families. Finally, we provide a methodology tional distribution if the feature space is sufficiently for unnormalized density estimation within this class. rich (Sriperumbudur et al., 2010), and has consistency guarantees under appropriate smoothness assumptions. 2.1 Kernel Conditional Exponential Family Drawing samples given a conditional mean embedding can be challenging: this is possible via the Herding pro- We consider the task of estimating the density p(y) of cedure (Bach et al., 2012; Chen et al., 2010), as shown a random variable Y with support Rd from i.i.d in (Kanagawa et al., 2016), but requires a non-convex Y ⊆ samples (Y )n . We propose to use a family of densities optimisation procedure to be solved for each sample. i i=1 parametrized by functions belonging to a Reproducing A powerful and recent deep learning approach to mod- Kernel Hilbert Space (RKHS) with positive semi- HY elling conditional densities is the Neural Autoregressive definite kernel k (Canu et al., 2006; Fukumizu, 2009; Network (Uria et al., 2013, Raiko et al., 2014 and Uria Sriperumbudur et al., 2017). This exponential family et al., 2016). These networks can be thought of as a gen- of density functions takes the form eralization of the Mixture Density Network introduced by Bishop, 2006. In brief, each variable is represented exp f, k(y, .) p(y) := q (y) h iHY f , (1) by a mixture of Gaussians, with means and variances 0 Z(f) ∈ F Michael Arbel, Arthur Gretton where q is a base density function on and is where for all x and x0 , Γ(x, x0 ) is a bounded linear 0 Y F the set of functions in the RKHS space such that operator from to , i.e., Γ(x, x0 ) ( ). The HY HY HY ∈ L HY Z(f) := exp f, k(y, .) q0(y)dx < . In what fol- space is said to be generated by an operator valued h iHY ∞ H lows, we callY this family the kernel exponential family reproducing kernel Γ. One practical choice for Γ is to R (KEF) by analogy to classical exponential family. f define it as: plays the role of the natural parameter while k(y, .) is the sufficient statistic. Note that with an appro- Γ(x, x0 ) = k (x, x0 )I x, x0 , (3) X HY ∀ ∈ X priate choice of the base distribution q0 and a finite dimensional RKHS , one can recover any finite where I the identity operator on and k is HY HY HY X dimensional exponential family.

Kernel Conditional Exponential Family

Deep Kernel Learning

Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD

3.2 the Periodogram

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Consistent Kernel Mean Estimation for Functions of Random Variables

Density and Distribution Estimation

Recurrent Kernel Networks

Nonparametric Density Estimation Histogram

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation

Regression Discontinuity Design

6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel