<<

Learning Deep Kernels for Densities

Li K. Wenliang * 1 Danica J. Sutherland * 1 Heiko Strathmann 1 Arthur Gretton 1

Abstract model allows for any log-density which is suitably smooth under a given kernel, i.e. any function in the correspond- The kernel exponential family is a rich class of ing reproducing kernel Hilbert space. Choosing a finite- distributions, which can be fit efficiently and with dimensional kernel recovers any classical exponential fam- statistical guarantees by score matching. Being ily, but when the kernel is sufficiently powerful the class required to choose a priori a simple kernel such as becomes very rich: dense in the family of continuous proba- the Gaussian, however, limits its practical applica- bility densities on compact domains in KL, TV, Hellinger, bility. We provide a scheme for learning a kernel and Lr distances (Sriperumbudur et al., 2017, Corollary 2). parameterized by a deep network, which can find The normalization constant is not available in closed form, complex location-dependent features of the local making fitting by maximum likelihood difficult, but the al- data geometry. This gives a very rich class of den- ternative technique of score matching (Hyvarinen,¨ 2005) sity models, capable of fitting complex structures allows for practical usage with theoretical convergence guar- on moderate-dimensional problems. Compared antees (Sriperumbudur et al., 2017). to deep density models fit via maximum likeli- hood, our approach provides a complementary The choice of kernel directly corresponds to a smoothness set of strengths and tradeoffs: in empirical stud- assumption on the model, allowing one to design a kernel ies, deep maximum-likelihood models can yield corresponding to prior knowledge about the target density. higher likelihoods, while our approach gives bet- Yet explicitly deciding upon a kernel to incorporate that ter estimates of the gradient of the log density, the knowledge can be complicated. Indeed, previous applica- score, which describes the distribution’s shape. tions of the kernel exponential family model have exclu- sively employed simple kernels, such as the Gaussian, with a small number of (e.g. the length scale) chosen by heuristics or cross-validation (e.g. Sasaki et al., 2014; 1. Introduction Strathmann et al., 2015). These kernels are typically spa- tially invariant, corresponding to a uniform smoothness is a foundational problem in assumption across the domain. Although such kernels are and (Devroye & Gyorfi,¨ 1985; Wasserman, sufficient for consistency in the infinite-sample limit, the 2006), lying at the core of both supervised and unsupervised induced models can fail in practice on finite datasets, espe- machine learning problems. Classical techniques such as cially if the data takes differently-scaled shapes in different kernel density estimation, however, struggle to exploit the parts of the space. Figure 1 (left) illustrates this problem structure inherent to complex problems, and thus can require when fitting a simple mixture of Gaussians. Here there are unreasonably large sample sizes for adequate fits. For in- two “correct” bandwidths, one for the broad mode and one stance, assuming only twice-differentiable densities, the L2 for the narrow mode. A translation-invariant kernel must risk of density estimation with N samples in D dimensions pick a single one, e.g. an average between the two, and any scales as (N −4/(4+D)) (Wasserman, 2006, Section 6.5). choice will yield a poor fit on at least part of the density. O One promising approach for incorporating this necessary In this work, we propose to learn the kernel of an exponen- structure is the kernel exponential family (Canu & Smola, tial family directly from data. We can then achieve far more 2006; Fukumizu, 2009; Sriperumbudur et al., 2017). This than simply tuning a length scale, instead learning location- dependent kernels that adapt to the underlying shape and *Equal contribution 1Gatsby Computational Neuroscience Unit, University College London, London, U.K.. Correspondence to: Li smoothness of the target density. We use kernels of the form K. Wenliang , Danica J. Sutherland . k(x, y) = κ(φ(x), φ(y)), (1)

Proceedings of the 36 th International Conference on Machine where the deep network φ extracts features of the input Learning, Long Beach, California, PMLR 97, 2019. Copyright and κ is a simple kernel (e.g. a Gaussian) on those fea- 2019 by the author(s). tures. These types of kernels have seen success in super- Learning Deep Kernels for Exponential Family Densities

0.8 true pdf true pdf KEF-G DKEF kernel kernel

) 0.6 x , ⋅ ( k

, 0.4 ) x ( p 0.2

0.0 −2 0 2 4 −2 0 2 4 x Figure 1: Fitting few samples from a Gaussian mixture, using kernel exponential families. Black dotted lines show k( 1, x), k(0, x), and k(1, x). (Left) Using a location-invariant Gaussian kernel, the sharper component gets too much− weight. (Right) A kernel parameterized by a neural network learns length scales that adapt to the density, giving a much better fit. vised learning (Wilson et al., 2016; Jean et al., 2018) and rameterized by θ; our goal is to use the data to select some ˆ D critic functions for implicit generative models (Li et al., θ such that pθˆ p0. The standard approach for selecting θ 2017; Bellemare et al., 2017; Binkowski´ et al., 2018; Arbel ≈ ˆ QN is maximum likelihood: θ = arg maxθ n=1 pθ(xn). et al., 2018), among other settings. We call the resulting model a deep kernel exponential family (DKEF). Many interesting model classes, however, are defined as pθ(x) =p ˜θ(x)/Zθ, where the normalization constant We can train both kernel parameters (including all the R Zθ = x p˜θ(x)dx cannot be easily computed. In this set- weights of the deep network) and, unusually, even regu- ting, an optimization algorithm to estimate θ by maximum larization parameters directly on the data, in a form of likelihood requires estimating (the derivative of) Zθ for each meta-learning. Normally, directly optimizing regularization candidate θ considered during optimization. Moreover, the parameters would always yield 0, since their beneficial ef- maximum likelihood solution may not even be well-defined fect in preventing overfitting is by definition not seen on the when θ is infinite-dimensional (Barron & Sheu, 1991; Fuku- training set. Here, though, we can exploit the closed-form mizu, 2009). The intractability of maximum likelihood led fit of the kernel exponential family to optimize a “held-out” Hyvarinen¨ (2005) to propose an alternative objective, called score (Section 3). Figure 1 (right) demonstrates the success score matching. Rather than maximizing the likelihood, one of this model on the same mixture of Gaussians; here the minimizes the Fisher divergence J(pθ p0): learned, location-dependent kernel gives a much better fit. k Z 1 2 p0(x) x log pθ(x) x log p0(x) dx. (2) We compare the results of our new model to recent general- 2 k∇ − ∇ k2 purpose deep density estimators, primarily autoregressive models (Uria et al., 2013; Germain et al., 2015; van den Under mild regularity conditions, this is equal to Oord et al., 2016) and normalizing flows (Jimenez Rezende Z D   & Mohamed, 2015; Dinh et al., 2017; Papamakarios et al., X 2 1 2 p0(x) ∂ log pθ(x) + (∂d log pθ(x)) dx, (3) d 2 2017). These models learn deep networks with structures x d=1 designed to compute normalized densities, and are fit via maximum likelihood. We explore the strengths and lim- up to an additive constant depending only on p0, which n itations of both deep likelihood models and deep kernel can be ignored during training. Here ∂d f(x) denotes ∂n ˆ exponential families on a variety of datasets, including ar- (∂y )n f(y) y=x. We can estimate (3) with J(pθ, ): d | D tificial data designed to illustrate scenarios where certain N D   surprising problems arise, as well as benchmark datasets 1 X X 2 1 2 ∂ log pθ(xn) + (∂d log pθ(xn)) . (4) used previously in the literature. The models fit by maxi- N d 2 n=1 d=1 mum likelihood typically give somewhat higher likelihoods, whereas the deep kernel exponential family generally better Notably, (4) does not depend on Zθ, and so we can minimize fits the shape of the distribution. it to find an unnormalized model p˜θˆ for p0. Score matching is consistent in the well-specified setting (Hyvarinen,¨ 2005, 2. Background Theorem 2), and was related to maximum likelihood by Lyu (2009), who argues it finds a fit robust to infinitesimal noise. N Score matching Suppose we observe = xn n=1, a D D { } Unnormalized models p˜ are sufficient for many tasks (Le- set of independent samples xn R from an unknown ∈ Cun et al., 2006), including finding modes, approximat- density p0(x). We posit a class of possible models pθ , pa- { } ing Hamiltonian Monte Carlo on targets without gradients Learning Deep Kernels for Exponential Family Densities

(Strathmann et al., 2015), and learning discriminative fea- λC term was recommended by Kingma & LeCun (2010), tures (Janzamin et al., 2014). If we require a normalized encouraging the learned log-density to be smooth without model, however, we can estimate the normalizing constant much extra computation; it provides some empirical ben- once, after estimating θ; this will be far more computation- efit in our context. Given k, z, and λ, Proposition 3 (Ap- ally efficient than estimating it at each step of an iterative pendix B) shows we can find the optimal α by solving an maximum likelihood optimization algorithm. M M linear system in (M 2ND + M 3) time: the α which× minimizes Jˆ(f k , λO, ) is α,z D Kernel exponential families The kernel exponential fam- −1 ily (Canu & Smola, 2006; Sriperumbudur et al., 2017) is α(λ, k, z, ) = (G + λαI + λC U) b (6) D − the class of all densities satisfying a smoothness constraint: N D 1 X X logp ˜(x) = f(x) + log q0(x), where q0 is some fixed func- Gm,m0 = ∂dk(xn, zm) ∂dk(xn, zm0 ) N tion and f is any function in the reproducing kernel Hilbert n=1 d=1 space with kernel k. This class is an exponential family N D H 1 X X 2 2 with natural f and sufficient statistic k(x, ), due Um,m0 = ∂ k(xn, zm) ∂ k(xn, zm0 ) · N d d to the reproducing property f(x) = f, k(x, ) H: n=1 d=1 h · i N D 1 X X 2 p˜f (x) = exp (f(x)) q0(x) = exp ( f, k(x, ) H) q0(x). bm = ∂ k(xn, zm) + ∂d log q0(xn) ∂dk(xn, zm) h · i N d n=1 d=1 Using a simple finite-dimensional , we can recover any +λ ∂2 log q (x ) ∂2k(x , z ). standard exponential family, e.g. normal,H gamma, or Pois- C d 0 n d n m son; if is sufficiently rich, the family can approximate H any continuous distribution with tails like q0 arbitrarily well 3. Fitting Deep Kernels (Sriperumbudur et al., 2017, Example 1 and Corollary 2). All previous applications of score matching in the kernel These models do not in general have a closed-form normal- exponential family of which we are aware (e.g. Sriperum- izer. For some f and q0, p˜f may not even be normalizable, budur et al., 2017; Strathmann et al., 2015; Sun et al., but if q0 is e.g. a Gaussian density, typical choices of φ and 2015; Sutherland et al., 2018) have used kernels of the form 1 2 T 2 κ in (1) guarantee a normalizer exists (Appendix A). k(x, y) = exp 2 x y +r x y + c , with ker- − 2σ k − k Sriperumbudur et al. (2017) proved good statistical proper- nel parameters and regularization weights either fixed a pri- ori or selected via cross-validation. This simple form allows ties for choosing f by minimizing a regularized form ∈ H 2 the various kernel derivatives required in (6) to be easily of (4), fˆ = arg min Jˆ(˜pf , ) + λ f , but their algo- f∈H H computed by hand, and the small number of parameters rithm has an impractical computationalD k costk of (N 3D3). This can be alleviated with the Nystrom-type¨ “lite”O approx- makes grid search adequate for model selection. But, as imation (Strathmann et al., 2015; Sutherland et al., 2018): discussed in Section 1, these simple kernels are insufficient D for complex datasets. Thus we wish to use a richer class of select M inducing points zm R , and select f as ∈ ∈ H kernels kw , with a large number of parameters w – in par- { } M ticular, kernels defined by a neural network. This prohibits k X k x x z k (5) model selection via simple grid search. fα,z( ) = αmk( , m), p˜α,z =p ˜fα,z . m=1 One could attempt to directly minimize Jˆ(f kw , λ, ) α,z D z D jointly in the kernel parameters w, the model parameters α, As the span of k( , ) z∈R is dense in , this is a natural approximation,{ similar· } to classical RBFH networks (Broom- and perhaps the inducing points z. Consider, however, the head & Lowe, 1988). The “lite” model often yields excellent case where we simply use a Gaussian kernel and zm = . { } D empirical results at much lower computational cost than the Then we can achieve arbitrarily good values of (3) by taking full estimator. We can regularize (4) in several ways and σ 0, drastically overfitting to the training set . → D still find a closed-form solution for α. In this work, our loss We can avoid this problem – and additionally find the best Jˆ(f k , λ, ) will be α,z D values for the regularization weights λ – with a form of meta-learning. We find choices for the kernel and reg- N D k λα 2 λC X X  2 k 2 ularization which will give us a good value of Jˆ on a Jˆ(˜p , )+ α + ∂ logp ˜ (xn) . α,z D 2 k k 2N d α,z “validation set” when fit to a fresh “training set” . n=1 d=1 v t Specifically, weD take stochastic gradient steps followingD ˆ kw Sutherland et al. (2018) used a small λα for numerical sta- λ,w,zJ(˜pα(λ,k ,z,D ),z, v). We can easily do this be- k 2 ∇ w t D bility but primarily regularized with λH f . As we cause we have a differentiable closed-form expression (6) k α,zkH change k, however, f H changes meaning, and we found for the fit to t, rather than having to e.g. back-propagate empirically that thisk regularizerk tends to harm the fit. The through an unrolledD iterative optimization procedure. As we Learning Deep Kernels for Exponential Family Densities

Algorithm 1: Full training procedure Combining R components makes it easier to account for both short-range and long-range dependencies. We con- input: Dataset ; initial inducing points z, kernel PR D strain ρr 0 to ensure a valid kernel, and ρr = 1 for parameters w, regularization λ = (λα, λC ) ≥ r=1 simplicity. The networks φw are made of L fully connected Split into 1 and 2; D D D layers of width W . For L > 1, we found that adding a skip Optimize w, λ, z, and maybe q0 params: kw connection from data directly to the top layer speeds up while Jˆ(˜p , 2) still improving do α(λ,kw,z,D1),z D learning. A softplus nonlinearity, log(1 + exp(x)), ensures Sample disjoint data subsets t, v 1; that the model is twice-differentiable so (3) is well-defined. PM D D ⊂ D f( ) = αm(λ, kw, z, t)kw(zm, ); · m=1 D · 1 P|Dv | PD  2 1 2 Jˆ= ∂ f(xn) + (∂df(xn)) ; 3.1. Behavior on Mixtures |Dv | n=1 d=1 d 2 ˆ Take SGD step in J for w, λ, z, maybe q0 params; One interesting limitation of score matching is the following: end suppose that p0 is composed of two disconnected compo- Optimize λ for fitting on larger batches: nents, p0(x) = πp1(x) + (1 π)p2(x) for π (0, 1) and kw − ∈ while Jˆ(˜p , 2) still improving do α(λ,kw,z,D1),z p1, p2 having disjoint, separated support. Then log p0(x) PM D ∇ f( ) = αm(λ, kw, z, 1)kw( , zm); will be log p1(x) in the support of p1, and log p2(x) · m=1 D · ∇ ∇ Sample subset v D2; in the support of p2. Score matching compares logp ˜θ to D ⊂ ∇ ˆ 1 P|Dv | PD  2 1 2 log p1 and log p2, but is completely blind to p˜θ’s rela- J = |D | n=1 d=1 ∂df(xn) + 2 (∂df(xn)) ; ∇ ∇ v tive mass between the two components; it is equally happy ˆ Take SGD steps in J for λ only; with any reweighting of the components. end If all modes are connected by regions of positive density, Finalize α on 1: D then the log density gradient in between components will Find α = α(λ, kw, z, 1); PMD determine their relative weight, and indeed score matching return: logp ˜( ) = m=1 αmkw( , zm) + log q0( ); · · · is then consistent. But when p0 is nearly zero between two dense components, so that there are no or few samples in between, score matching will generally have insufficient used small minibatches in this procedure, for the final fit we evidence to weight nearly-separate components. use the whole dataset: we first freeze w and z and find the optimal λ for the whole training data, then finally fit α with Proposition 4 (Appendix C) studies the the kernel exponen- the new λ. This process is summarized in Algorithm 1. tial family in this case. For two components that are com- pletely separated according to k, (6) fits each as it would if Computing kernel derivatives Solving for α and com- given only that component, except that the effective λα is puting the loss (4) require matrices of kernel second deriva- scaled: smaller components are regularized more. tives, but current deep learning-oriented automatic differenti- Appendix C.1 studies a simplified case where the kernel ation systems are not optimized for evaluating tensor-valued length scale σ is far wider than the component; then the higher-order derivatives at once. We therefore implement density ratio between components, which should be π , backpropagation to compute G, U, and b of (6) as Tensor- 1−π  D 1  Flow operations (Abadi et al., 2015) to obtain the scalar loss is approximately exp 2 π . Depending on the 2σ λα − 2 ˆ D J, and used TensorFlow’s automatic differentiation only to value of 2 , this ratio will often either be quite extreme, 2σ λα optimize w, z, λ, and q0 parameters. or nearly 1. It is unclear, however, how well this result Backpropagation to find these second derivatives requires generalizes to other settings. explicitly computing the Hessians of intermediate layers of A heuristic workaround when disjoint components are sus- the network, which becomes quite expensive as the model pected is as follows: run a clustering algorithm to identify grows; this limits the size of kernels that our model can use. disjoint components, separately fit a model to each clus- A more efficient implementation based on Hessian-vector ter, then weight each model according to its sample count. products might be possible in an automatic differentiation When the components are well-separated, this clustering is system with better support for matrix derivatives. straightforward, but it may be difficult in high-dimensional cases when samples are sparse but not fully separated. Kernel architecture We will choose our kernel kw(x, y) as a mixture of R Gaussian kernels with length scales σr, 3.2. Model Evaluation taking in features of the data extracted by a network φwr ( ): · In addition to qualitatively evaluating fits, we will evalu- R   X 1 2 ate our models with three quantitative criteria. The first ρr exp φw (x) φw (y) . (7) 2σ2 r r is the finite-set Stein discrepancy (FSSD; Jitkrittum et al., r=1 − r k − k Learning Deep Kernels for Exponential Family Densities

2017), a measure of model fit which does not depend (Appendix D.1). This estimate of the upper bound is itself on the normalizer Zθ. It examines the fit of the model biased upwards (Proposition 6), so it is likely, though not B at J test locations V = vb b=1 using an kernel l( , ), guaranteed, that the estimate overstates the amount of bias. 1 PB { } · ·2 as DB b=1 Ex∼p0 [l(x, vb) x log p(x)+ xl(x, vb)] . With randomlyk selected V and∇ some mild assumptions,∇ k it 3.3. Previous Attempts at Deep Score Matching is zero if and only if p = p . We use a Gaussian kernel 0 Kingma & LeCun (2010) used score matching to train a with bandwidth equal to the median distance between test (one-layer) network to output an unnormalized log-density. points,1and choose V by adding small Gaussian noise to This approach is essentially a special case of ours: use the data points. Jitkrittum et al. (2018) construct a hypothesis D 0 kernel kw(x, y) = φw(x)φw(y), where φw : R R. test to test which of p and p is closer to p0 in the FSSD. We → Then the function f kw (x) from (5) is will report a score – the p-value of this test – which is near α,z 0 0 when model p is better, near 1 when model p is better, M " M # 1 X X and around 2 when the two models are equivalent. We em- αmφw(zm)φw(x) = αmφw(zm) φw(x). phasize that we are using this as a model comparison score m=1 m=1 on an interpretable scale, but not following a hypothesis testing framework. Another similar performance measure The scalar in brackets is fit analytically, so log p is deter- is the kernel Stein discrepancy (KSD) (Chwialkowski et al., mined almost entirely by the network φw plus log q0(x). 2016), where the model’s goodness-of-fit is evaluated at all Saremi et al. (2018) recently also attempted parameterizing test data points rather than at random test locations. We the unnormalizing log-density as a deep network, using an omit the results as they are essentially identical to that of approximation called Parzen score matching. This approxi- the FSSD, even across a wide range of kernel bandwidths. mation requires a global constant bandwidth to define the As all our models are twice-differentiable, we also compare Parzen window size for the loss function, fit to the dataset the score-matching loss (4) on held-out test data. A lower before learning the model. This is likely only sensible on score-matching loss implies a smaller Fisher divergence datasets for which simple fixed-bandwidth kernel density es- between model and data distributions. timation is appropriate; on more complex datasets, the loss may be poorly motivated. It also leads to substantial over- Finally, we compare test log-likelihoods, using importance smoothing visible in their results. As they did not provide sampling estimates of the normalizing constant Zθ: code for their method, we do not compare to it empirically.

U 1 X p˜θ(yu) Zˆθ = ru where ru := , yu q0, 4. Experimental Results U q (y ) u=1 0 u ∼ In our experiments, we compare to several alternative meth- ˆ R p˜θ (yu) ods. The first group are all fit by maximum likelihood, and so E Zθ = q (y ) q0(yu) = Zθ. Our log-likelihood 0 u broadly fall into (at least) one of two categories: autoregres- estimate is logp ˆ (x) = logp ˜ (x) log Zˆ . This esti- θ θ θ sive models decompose x x QD x x mator is consistent, but Jensen’s inequality− tells us that p( 1,..., D) = d=1 p( d ≤d) and learn a parametric density model for each of these| con- logp ˆ (x) > log p (x), so our evaluation will be over- E θ θ ditionals. Normalizing flows instead apply a series of in- optimistic. Worse, the variance of log Zˆ can be mislead- θ vertible transformations to some simple initial density, say ingly small when the bias is still quite large; we observed standard normal, and then compute the density of the over- this in our experiments. We can, though, bound the bias: all model via the Jacobian of the transformation. We use Proposition 1. Suppose that a, s R are such that implementations2 of the following several models in these ∈ 1 Pr(ru a) = 1 and Pr(ru s) ρ < 2 . Define categories by Papamakarios et al. (2017): ≥ ≤ Z≤ q t := (s + a)/2, ψ(q, Zθ) := log + 1, and let q Z − MADE (Germain et al., 2015) masks the connections of an P := max (ψ(a, Z ), ψ(t, Z )). Then θ θ autoencoder so it is autoregressive. We use two hidden lay-

ψ (t, Zθ) Var[ru] U ers, and each conditional a Gaussian. MADE-MOG is the ˆ 2 log Zθ E log Zθ 2 +P (4ρ(1 ρ)) . same but with each conditional a mixture of 10 Gaussians. − ≤ (Zθ t) U − − Real NVP (Dinh et al., 2017) is a normalizing flow; we use (Proof in Appendix D.) We can find a, s because we propose a general-purpose form for non-image datasets. from q , and thus we can effectively estimate the bound 0 MAF (Papamakarios et al., 2017). A combination of a 1This reasonable choice avoids tuning any parameters. We do normalizing flow and MADE, where the base density is not optimize the kernel or the test locations to avoid a situation in modeled by MADE, with 5 autoregressive layers. MAF- which model p is better than p0 in some respects but p0 better than p in others; instead, we use a simple default mode of comparison. 2github.com/gpapamak/maf Learning Deep Kernels for Exponential Family Densities

MOG instead models the base density by MADE-MOG. ble in more complex cases, even with much larger networks than used by DKEF-G-15. It seems that a Gaussian kernel For the models above, we use layers of width 30 for exper- with inducing points provides much stronger representa- iments on synthetic data, and 100 for benchmark datasets. tional features than a using linear kernel and/or a single Larger values did not improve performance. inducing point. A large enough network φw would likely KCEF (Arbel & Gretton, 2018). Inspired by autoregressive be able to perform the task well, but, using currently avail- models, the density is modeled by a cascade of kernel condi- able software, the second derivatives in the score matching tional exponential family distributions, fit by score matching loss limit our ability to use very large networks. A similar with Gaussian kernels.3 phenomenon was observed by Binkowski´ et al. (2018) in the context of GAN critics, where combining some analyti- DKEF. On synthetic datasets, we consider four variants of cal RKHS optimization with deep networks allowed much our model with one kernel component, R = 1. KEF-G smaller networks to work well. refers to the model using a Gaussian kernel with a learned bandwidth. DKEF-G-15 has the kernel (7), with L = 3 As expected, models fit by DKEFs generally have smaller layers of width W = 15. DKEF-G-50 is the same with Fisher divergences than likelihood-based methods. For Fun- W = 50. To investigate whether the top Gaussian ker- nel and Banana, the true densities are simple transforma- nel helps performance, we also train DKEF-L-50, whose tions of Gaussians, and the normalizing flow models per- kernel is kθ(x, y) = φw(x) φw(y), where φw has form relatively well. But on Ring, Square, and Cosine, the W = 50. To compare with the· architecture of Kingma shape of the learned distribution by likelihood-based meth- & LeCun (2010), DKEF-L-50-1 has the same architec- ods exhibits noticeable artifacts, especially at the boundaries. ture as DKEF-L-50 except that we add an extra layer with These artifacts, particularly the “breaks” in Ring, may be a single neuron, and use M = 1. In all experiments, caused by a normalizing flow’s need to be invertible and QD βd 2  q0(x) = d=1 exp xd µd /(2σd) , with βd > 1. smooth. The shape learned by DKEF-G-15 is much cleaner. On benchmark datasets,−| we− use| DKEF-G-50 and KEF-G On multimodal distributions with disjoint components, with three kernel components, R = 3. Code for DKEF is at likelihood-based and score matching-based methods show github.com/kevin-w-li/deep-kexpfam. interesting failure modes. The likelihood-based models often connect separated modes with a “bridge”, even for 4.1. Behavior on Synthetic Datasets MADE-MOG and MAF-MOG which use explicit mixtures. We first demonstrate the behavior of the models on several On the other hand, DKEF is able to find the shapes of the two-dimensional synthetic datasets Funnel, Banana, Ring, components, but the weighting between modes is unstable. Square, Cosine, Mixture of Gaussians (MoG) and Mixture As suggested in Section 3.1, we also fit mixtures of all mod- of Rings (MoR). Together, they cover a range of geometric els (except KCEF) on a partition of MoR found by spectral complexities and multimodality. clustering (Shi & Malik, 2000); DKEF-G-15 produced an excellent fit. We visualize the fit of various methods by showing the log density function in Figure 2. For each model fit on each Another challenge we observed in our experiments is that distribution, we report the normalized log likelihood (left) the estimator of the objective function, Jˆ, tends to be more and Fisher divergence (right). In general, the kernel score noisy as the model fit improves. This happens particularly matching methods find cleaner boundaries of the distribu- on datasets where there are “sharp” features, such as Square tions, and our main model KDEF-G-15 produces the lowest (see Figure 4 in Appendix E.1), where the model’s curvature Fisher divergence on many of the synthetic datasets while becomes extreme at some points. This can cause higher vari- maintaining high likelihoods. ance in the gradients of the parameters, and more difficulty in optimization. Among the kernel exponential families, DKEF-G outper- formed previous versions where ordinary Gaussian kernels 4.2. Results on Benchmark Datasets are used for either joint (KEF-G) or autoregressive (KCEF) modeling. DKEF-G-50 does not substantially improve over Following recent work on density estimation (Uria et al., DKEF-G-15; we omit it for space. We can gain additional 2013; Arbel & Gretton, 2018; Papamakarios et al., 2017; insights into the model by looking at the shape of the learned Germain et al., 2015), we trained DKEF and the likelihood- kernel, shown by the colored lines; the kernels do indeed based models on five UCI datasets (Dheeru & Karra Taniski- adapt to the local geometry. dou, 2017); in particular, we used RedWine, WhiteWine, Parkinson, HepMass, and MiniBoone. All performances DKEF-L-50 and DKEF-L-50-1 show good performance were measured on held-out test sets. We did not run KCEF when the target density has simple geometries, but had trou- due to its computational expense. Appendix E.2 gives fur- 3github.com/MichaelArbel/KCEF ther details. Learning Deep Kernels for Exponential Family Densities

Figure 2: Log densities learned by different models. Our model is DKEF-G-15 at the bottom row. Columns are different synthetic datasets. The rightmost columns shows a mixture of each model (except KCEF) on the same clustering of MoR. We subtracted the maximum from each log density, and clipped the minimum value at 9. Above each panel are shown the average log-likelihoods (left) and Fisher divergence (right) on held-out data points. Bold− indicates the best fit. For DKEF-G models, faint colored lines correspond to contours at 0.9 of the kernel evaluated at different locations. Learning Deep Kernels for Exponential Family Densities

RedWine (D=11) WhiteWine (D=11) Parkinsons (D=15) HepMass (D=22) MiniBoone (D=43)

c 1.0 i 0.4 t s i 0.10 t 1.0

a 0.5 t

s 0.2 0.5

2 0.05 0.5 D S

S 0.0 F 0.00 0.0 0.0 0.0

1.0 1.0 1.0 1.0 1.0 e u l

a 0.5 0.5 0.5 0.5 0.5 v - p

0.0 0.0 0.0 0.0 0.0

50 0 0 0 0 −1000 0 −1000 −250 −2000 −50 −50 −2000 −500 score matching loss −12 −30 −12 −12 −13 −20 −40

−14 −14 −50 −14 −30 log likelihood −15 −16 −60

MAF NVP MAF NVP MAF NVP MAF NVP MAF NVP DKEFKEF-GMADE DKEFKEF-GMADE DKEFKEF-GMADE DKEFKEF-GMADE DKEFKEF-GMADE MAF-MOG MAF-MOG MAF-MOG MAF-MOG MAF-MOG MADE-MOG MADE-MOG MADE-MOG MADE-MOG MADE-MOG

Figure 3: Results on the real datasets; bars show medians, points show each of 15 individual runs, excluding invalid values. (1st row) The estimate of the squared FSSD, a measure of model goodness of fit based on derivatives of the log density; lower is better. (2nd row) The p-value of a test that each model is no better than DKEF in terms of the FSSD; values near 0 indicate that DKEF fits the data significantly better than the other model. (3nd row) Value of the loss (4); lower is better. 10 (4th row) Log-likelihoods; higher is better. DKEF estimates are based on 10 samples for Zˆθ, with vertical lines showing the upper bound on the bias from Proposition 1 (which is often too small to be easily visible).

Figure 3 shows results. In gradient matching as measured 5. Discussion by the FSSD, DKEF tends to have the best values. Test set sizes are too small to yield a confident p-value on the Learning deep kernels helps make the kernel exponential Wine datasets, but the model comparison test confidently family practical for large, complex datasets of moderate favors DKEF on datasets with large-enough test sets. The dimension. We can exploit the closed-form fit of the α vec- FSSD results agree with KSD, which is omitted. In the tor to optimize kernel and even regularization parameters score matching loss4 (3), DKEF is the best on Wine datasets using a “held-out” loss, in a particularly convenient instance and most runs on Parkinson, but worse on Hepmass and of meta-learning. We are thus able to find smoothness as- MiniBoone. FSSD is a somewhat more “global” measure sumptions that fit our particular data, rather than arbitrarily of shape, and is perhaps more weighted towards the bulk choosing them a priori. of the distribution rather than the tails.5 In likelihoods, Computational expense makes score matching with deep ker- DKEF is comparable to other methods except MADE on nels difficult to scale to models with large kernel networks, Wines but worse on the other, larger, datasets. Note that we limiting the dimensionality of possible targets. Combining trained DKEF while adding Gaussian noise with standard with the kernel conditional exponential family might help deviation 0.05 to the (whitened) dataset; training without alleviate that problem by splitting the model up into several noise improves the score matching loss but harms likelihood, separate but marginally complex components. The kernel while producing similar results for FSSD. exponential family, and score matching in general, also struggles to correctly balance datasets with nearly-disjoint Results for models with a fixed Gaussian q0 were similar (Figure 6, in Appendix E.2). components, but it seems to generally learn density shapes better than maximum likelihood-based deep approaches. 4The implementation of Papamakarios et al. (2017) sometimes produced NaN values for the required second derivatives, espe- ACKNOWLEDGMENTS cially for MADE-MOG. We discarded those runs for these plots. 5With a kernel approaching a Dirac delta, the FSSD2 is similar This work was supported by the Gatsby Charitable Founda- 2 R 2 2 to KSD ≈ p0(x) k∇ log p(x)−∇ logp ˜θ(x)k dx; compare tion. We thank Heishiro Kanagawa for helpful discussions. 1 R 2 to J = 2 p0(x)k∇ log p(x) − ∇ logp ˜θ(x)k dx. Learning Deep Kernels for Exponential Family Densities

References Fukumizu, K. Exponential manifold by reproducing ker- nel Hilbert spaces. In Gibilisco, P., Riccomagno, E., Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Rogantin, M.-P., and Winn, H. (eds.), Algebraic and Ge- Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., ometric Methods in Statistics, pp. 291–306. Cambridge Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, University Press, 2009. M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Lev- enberg, J., Mane,´ D., Monga, R., Moore, S., Murray, D., Germain, M., Gregor, K., Murray, I., and Larochelle, H. Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, Masked autoencoder for distribution estimation. In ICML, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, 2015. V., Viegas,´ F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow: Large- Hyvarinen,¨ A. Estimation of non-normalized statistical scale machine learning on heterogeneous systems, 2015. models by score matching. JMLR, 6(Apr):695–709, 2005. Software available from tensorflow.org. Janzamin, M., Sedghi, H., and Anandkumar, A. Score Arbel, M. and Gretton, A. Kernel conditional exponential function features for discriminative learning: Matrix and family. In AISTATS, 2018. tensor framework, 2014.

Arbel, M., Sutherland, D. J., Binkowski, M., and Gretton, A. Jean, N., Xie, S., and Ermon, S. Semi-supervised deep On gradient regularizers for MMD GANs. In NeurIPS, kernel learning: Regression with unlabeled data by mini- 2018. mizing predictive variance. In NeurIPS, 2018.

Arratia, R. and Gordon, L. Tutorial on large deviations Jimenez Rezende, D. and Mohamed, S. Variational infer- for the binomial distribution. Bulletin of Mathematical ence with normalizing flows. In ICML, 2015. Biology, 51:125–131, 1989. Jitkrittum, W., Xu, W., Szabo,´ Z., Fukumizu, K., and Gret- Barron, A. and Sheu, C.-H. Approximation of density func- ton, A. A linear-time kernel goodness-of-fit test. In NIPS, tions by sequences of exponential families. Annals of 2017. Statistics, 19(3):1347–1369, 1991. Jitkrittum, W., Kanagawa, H., Szabo,´ Z., Sangkloy, P., Hays, Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., J., Scholkopf,¨ B., and Gretton, A. Informative features Lakshminarayanan, B., Hoyer, S., and Munos, R. The for model comparison. In NeurIPS, 2018. Cramer distance as a solution to biased Wasserstein gra- dients, 2017. Kingma, D. P. and Ba, J. L. Adam: A method for stochastic optimization. In ICLR, 2015. Binkowski,´ M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In ICLR, 2018. Kingma, D. P. and LeCun, Y. Regularized estimation of image statistics by score matching. In NIPS, 2010. Broomhead, D. S. and Lowe, D. Radial basis functions, multi-variable functional interpolation and adaptive net- LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, works. Technical report, Royal Signals and Radar Estab- F. A tutorial on energy-based learning. In Predicting lishment Malvern (United Kingdom), 1988. structured data. MIT Press, 2006.

Canu, S. and Smola, A. J. Kernel methods and the exponen- Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., and Poczos,´ B. tial family. Neurocomputing, 69:714–720, 2006. MMD GAN: Towards deeper understanding of moment matching network. In NIPS, 2017. Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In ICML, 2016. Liao, J. G. and Berg, A. Sharpening Jensen’s inequality. The American Statistician, 2018. Devroye, L. and Gyorfi,¨ L. Nonparametric Density Estima- tion: The L1 View. John Wiley and Sons, 1985. Lyu, S. Interpretation and generalization of score matching. In Uncertainty in Artificial Intelligence, UAI ’09, 2009. Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017. URL http://archive.ics.uci. Papamakarios, G., Pavlakou, T., and Murray, I. Masked edu/ml. autoregressive flow for density estimation. In NIPS, 2017.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- Saremi, S., Mehrjou, A., Scholkopf,¨ B., and Hyvarinen,¨ A. tion using Real NVP. In ICLR, 2017. Deep energy estimator networks, 2018. Learning Deep Kernels for Exponential Family Densities

Sasaki, H., Hyvarinen,¨ A., and Sugiyama, M. Clustering via mode seeking by direct estimation of the gradient of a log-density. In ECML/PKDD, volume Lecture Notes in Computer Science Part III, vol. 8726, pp. 19–34, 2014. Shi, J. and Malik, J. Normalized cuts and image segmenta- tion. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.

Sriperumbudur, B., Fukumizu, K., Gretton, A., Hyvarinen,¨ A., and Kumar, R. Density estimation in infinite dimen- sional exponential families. JMLR, 18(1):1830–1888, 2017.

Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo,´ Z., and Gretton, A. Gradient-free Hamiltonian Monte Carlo with efficient kernel exponential families. In NIPS, 2015. Sun, S., Kolar, M., and Xu, J. Learning structured densities via infinite dimensional exponential families. In NIPS, 2015.

Sutherland, D. J., Strathmann, H., Arbel, M., and Gretton, A. Efficient and principled score estimation with Nystrom¨ kernel exponential families. In AISTATS, 2018. Uria, B., Murray, I., and Larochelle, H. Rnade: The real- valued neural autoregressive density-estimator. In NIPS, 2013. van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K. Pixel recurrent neural networks. In ICML, 2016. Wasserman, L. All of Nonparametric Statistics. Springer, 2006. Wilson, A. G., Hu, Z., Salakhutdinov, R., and Xing, E. P. Deep kernel learning. In AISTATS, 2016.