View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by UCL Discovery

Kernel Estimation via Doubly Dual Embedding

1 2 3 2 1 4 ⇤Bo Dai ⇤Hanjun Dai Arthur Gretton Le Song Dale Schuurmans Niao He 1Google Brain 2Georgia Tech 3UCL 4UIUC

Abstract flexibility of the exponential family to an infinite- dimensional parameterization via reproducing kernel We investigate penalized maximum log- Hilbert spaces (RKHS) (Canu and Smola, 2006). likelihood estimation for exponential fam- Maximum log-likelihood estimation (MLE) has al- ily distributions whose natural re- ready been well-studied in the case of finite- sides in a reproducing kernel Hilbert space. dimensional exponential families, where desirable sta- Key to our approach is a novel technique, tistical properties such as asymptotic unbiasedness, doubly dual embedding, that avoids compu- consistency and asymptotic normality have been es- tation of the partition function. This tech- tablished. However, it is dicult to extend MLE to nique also allows the development of a flex- the infinite-dimensional case. Beyond the intractabil- ible sampling strategy that amortizes the ity of evaluating the partition function for a general cost of Monte-Carlo sampling in the infer- exponential family, the necessary conditions for max- ence stage. The resulting estimator can be imizing log-likelihood might not be feasible; that is, easily generalized to kernel conditional ex- there may exist no solution to the KKT conditions in ponential families. We establish a connec- the infinite-dimensional case (Pistone and Rogantin, tion between kernel exponential family es- 1999; Fukumizu, 2009). To address this issue, Barron timation and MMD-GANs, revealing a new and Sheu (1991); Gu and Qiu (1993); Fukumizu (2009) perspective for understanding GANs. Com- considered several ways to regularize the function pared to the score matching based estima- space by constructing (a series of) finite-dimensional tors, the proposed method improves both spaces that approximate the original RKHS, yielding a memory and time eciency while enjoying tractable estimator in the restricted finite-dimensional stronger statistical properties, such as fully space. However, as Sriperumbudur et al. (2017) note, capturing smoothness in its statistical con- even with the finite-dimension approximation, these vergence rate while the score matching es- algorithms are still expensive as every update requires timator appears to saturate. Finally, we Monte-Carlo sampling from a current model to com- show that the proposed estimator empirically pute the partition function. outperforms state-of-the-art methods in both kernel exponential family estimation and its An alternative score matching based estimator has re- conditional extension. cently been introduced by Sriperumbudur et al. (2017). This approach replaces the Kullback-Leibler (KL) with the Fisher divergence, defined by the expected 1 Introduction squared distance between the score of the model (i.e., The exponential family is one of the most impor- the derivative of the log-density) and the target dis- tant classes of distributions in and machine tribution score (Hyv¨arinen, 2005). By minimizing learning. The exponential family possesses a number the Tikhonov-regularized Fisher divergence, Sriperum- of useful properties (Brown, 1986)and includes many budur et al. (2017) develop a computable estimator commonly used distributions with finite-dimensional for the infinite-dimensional exponential family that natural , such as the Gaussian, Poisson also obtains a consistency guarantee. Recently, this and multinomial distributions, to name a few. It method has been generalized to conditional infinite- is natural to consider generalizing the richness and dimensional exponential family estimation (Arbel and Gretton, 2017). Although score matching avoids Proceedings of the 22nd International Conference on Ar- computing the generally intractable integral, it re- tificial Intelligence and Statistics (AISTATS) 2019, Naha, quires computing and saving the first- and second- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by order derivatives of the reproducing kernel for each the author(s). ⇤ indicates equal contribution. Kernel Exponential Family Estimation via Doubly Dual Embedding dimension on each sample. For n samples with d fea- where A (f) := log exp (f (x)) p (x) dx, x ⌦ ⌦ 0 2 ⇢ tures, this results in n2d2 memory and n3d3 d, , and := f :exp(A (f)) < .In O O R R R time cost respectively, which becomes prohibitive for this paper,2 we mainlyF { focus2H on the case where1} is a datasets with moderate sizes of n and d. To alleviate reproducing Hilbert kernel space (RKHS) withH kernel this cost, Sutherland et al. (2017) utilize the m-rank k (x, x0), such that f (x)= f,k (x, ) . However, we Nyst¨omapproximation to the kernel matrix, reduc- emphasize that the proposedh algorithm· i in Section 4 ing memory and time complexity to nmd2 and can be easily applied to arbitrary di↵erentiable func- O m3d3 respectively. Although this reduces the cost tion parametrizations, such as deep neural networks. O dependence on sample size, the dependence on d is N Given samples =[xi] ,amodelwithafinite- una↵ected, hence the estimator remains unsuitable for i=1 dimensional parameterizationD can be learned via max- high-dimensional data. Estimating a general exponen- imum log-likelihood estimation (MLE), tial family model by either MLE or score matching also N generally requires some form of Monte-Carlo sampling, 1 max log pf (xi)=E [f(x) + log p0 (x)] A(f), such as MCMC or HMC, to perform inference, which f N D 2F i=1 significantly adds to the computation cost. X where E [ ] denotes theb empirical expectation over . In summary, both the MLE and score matching The MLED · (2) is well-studied and has nice properties.D based estimators for the infinite-dimensional exponen- However,b the MLE is known to be “ill-posed” in the tial family incur significant computational overhead, infinite-dimensional case, since the optimal solution particularly in high-dimensional applications. might not be exactly achievable in the representable In this paper, we revisit penalized MLE for the kernel space. Therefore, penalized MLE has been introduced exponential family and propose a new estimation strat- for the finite (Dud´ık et al., 2007) and infinite dimen- egy. Instead of solving the log-likelihood equation di- sional Gu and Qiu (1993); Altun and Smola (2006) rectly, as in existing MLE methods, we exploit a doubly exponential families respectively. Such regularization dual embedding technique that leads to a novel saddle- essentially relaxes the moment matching constraints point reformulation for the MLE (along with its con- in terms of some norm, as shown later, guaranteeing ditional distribution generalization) in Section 3. We the existence of a solution. In this paper, we will also then propose a stochastic algorithm for the new view focus on MLE with RKHS norm regularization. of penalized MLE in Section 4. Since the proposed es- One useful theoretical property of MLE for the expo- timator is based on penalized MLE, it does not require nential family is convexity w.r.t. f. With convex regu- the first- and second-order derivatives of the kernel, as larization, e.g., f 2 , one can use stochastic gradient in score matching, greatly reducing the memory and descent to recoverk k aH unique global optimum. Let L (f) time cost. Moreover, since the saddle point reformula- denote RKHS norm penalized log-likelihood, i.e., tion of penalized MLE avoids the intractable integral, N 1 ⌘ 2 the need for a Monte-Carlo step is also bypassed, thus L (f):= log pf (xi) f . (2) N 2 k kH accelerating the learning procedure. This approach i=1 X also learns a flexibly parameterized sampler simulta- where ⌘>0 denotes the regularization parameter. neously, therefore, further reducing inference cost. We The gradient of L (f)w.r.t. f can be computed as present the consistency and rate of the new estima- tion strategy in the well-specified case, i.e.,thetrue f L = E [ f f(x)] f A(f) ⌘f. (3) density belongs to the kernel exponential family, and r D r r To calculate the f A(f) in (3), we denote Z (f)= an algorithm convergence guarantee in Section 5. We b r ⌦ exp (f (x)) p0 (x) dx and expand the partition demonstrate the empirical advantages of the proposed function by definition, algorithm in Section 6, comparing to state-of-the-art R 1 estimators for both the kernel exponential family and f A(f)= f exp (f(x)) p0 (x) dx r Z(f)r Z⌦ its conditional extension. p (x)exp(f(x)) = 0 f(x)dx Z(f) rf 2 Preliminaries Z⌦ = Epf (x) [ f f(x)] . (4) We first provide a preliminary introduction to the ex- r One can approximate the Epf (x) [ f f(x)] by ponential family and Fenchel duality, which will play r vital roles in the derivation of the new estimator. AIS/MCMC samples (Vembu et al., 2009), which leads to the Contrastive Divergence (CD) algorithm (Hin- 2.1 Exponential family ton, 2002). The natural form of the exponential family over ⌦with To avoid costly MCMC sampling in estimating the gra- the sucient statistics f is defined as 2F dient, Sriperumbudur et al. (2017) construct an esti- p (x)=p (x)exp(f(x) A (f)) , (1) f 0 1 2 3 2 1 4 ⇤Bo Dai , ⇤Hanjun Dai ,ArthurGretton,LeSong,DaleSchuurmans,NiaoHe

mator based on score matching instead of MLE, which The supremum achieves if v @h⇤(u), or equivalently minimizes the penalized Fisher divergence. Plugging u @h(v). 2 the kernel exponential family into the empirical Fisher 2 divergence, the optimization reduces to 3 A Saddle-Point Formulation of 2 ⌘ Penalized MLE J (f):= f,Cf + f,ˆ + f 2 , (5) 2 h iH h iH 2 k kH As discussed in Section 2, the penalized MLE for where the exponential family involves computing the log- n d b 1 partition functions, A (f) and A (f), which are in- C := @ k (x , ) @ k (x , ) , x n j i · ⌦ j i · tractable in general. In this section, we first introduce i=1 j=1 X X a saddle-point reformulation of the penalized MLE of b 1 n d ˆ := @ k (x , ) @ log p (x )+@2k (x , ) . the exponential family, using Fenchel duality to by- n j i · j 0 i j i · i=1 j=1 pass the computation of the intractable log-partition X X As we can see, the score matching objective (5) is function. This approach can also be generalized to convex and does not involve the intractable integral the conditional exponential family. First, observe that in A (f). However, such an estimator requires the we can rewrite the log-partition function A (f)via computation of first- and second-order of derivatives Fenchel duality as follows. of kernel for each dimension on each data, leading to Theorem 1 (Fenchel dual of log-partition) 2 2 3 3 memory and time cost of n d and n d re- A (f) = max q(x),f(x) KL(q p0) , (7) O O q h i2 || spectively. This quickly becomes prohibitive for even 2P pf (x) = argmax q(x),f(x) 2 KL(q p0) ,(8) modest n and d. q h i || 2P The same diculty also appears in the score matching where f,g 2 := ⌦ f (x) g (x) dx, denotes the space h i P q(x) based estimator in (Arbel and Gretton, 2017) for the of distributions and KL(q p0):= q (x) log dx. R ⌦ p0(x) conditional exponential family, which is defined as || R p (y x)=p (y)exp(f (x, y) A (f)) ,f (6) Proof Denote l (q):= q(x),f(x) KL(q p0), | 0 x 2F h i || p d which is strongly concave w.r.t. q , the optimal q where y ⌦y R , x ⌦x R , R, and ⇤ 2 ⇢ 2 ⇢ 2 2P Ax (f) := log p0(y)exp(f (x, y)) dy, is de- can be obtained by setting the ⌦y F fined as f :exp(A (f)) < . We consider log q⇤(x) f (x) + log p0(x). { 2HyR x 1} / f :⌦x ⌦y R such that f (x, ) is in RKHS y Since q⇤ ,wehave ⇥ ! · H 2P for x ⌦x. Denoting ⌦x :⌦x y such q⇤(x)=p0(x)exp(f(x) A (f)) = pf (x), that8 2(y)=f (x, y), weT can2H derive its kernel!H func- x which leads to (8). Plugging q⇤ to l(q), we obtain the tion followingT Micchelli and Pontil (2005); Arbel and maximum as log ⌦ exp (f(x)) p0(x)dx,whichisex- Gretton (2017). By the Riesz representation theorem, actly A (f), leading to (7). x ⌦ and h , there exists a linear operator R x y Therefore, invoking the Fenchel dual of A (f)into 8 :2 such2H that x Hy !H⌦x the penalized MLE, we achieve a saddle-point opti- h, x = , xh , ⌦ . mization, i.e., h T iHy hT iH⌦x 8T2H x Then, the kernel can be defined by composing max L (f) (9) x f 2F / with its dual, i.e., k (x, x0)=x⇤ x0 and the function ⌘ 2 1 min E [f(x)] Eq(x) [f(x)] f + KL(q p0) . f (x, y)= x (y)= , xk (y, ) . We follow such as- q D 2P 2 k kH || sumptionsT for kernelhT conditional· i exponential family. b `(f,q) The saddle-point| reformulation{z of the penalized MLE} 2.2 Convex conjugate and Fenchel duality in (9) resembles the optimization in MMD GAN (Li Denote h ( ) as a function Rd R, then its convex et al., 2017; Bi´nkowski et al., 2018). In fact, the · ! conjugate function is defined as dual problem of the penalized MLE of the exponen- h⇤(u)= sup u>v h(v) . tial family is a KL-regularized MMD GAN with a spe- v d{ } 2R cial design of the kernel family. Alternatively, if f is If h ( ) is proper, convex and lower semicontinuous, the a Wasserstein-1 function, the optimization resembles · conjugate function, h⇤ ( ), is also proper, convex and the Wasserstein GAN (Arjovsky et al., 2017). · lower semicontinuous. Moreover, h and h⇤ are dual Next, we consider the duality properties. to each other, i.e.,(h⇤)⇤ = h. Such a relationship is known as Fenchel duality (Rockafellar, 1970; Hiriart- Theorem 2 (weak and strong duality) The weak Urruty and Lemar´echal, 2012). By the conjugate func- duality holds in general, i.e., tion, we can represent the h by as, max min `(f,q) 6 min max `(f,q). f q q f h(v)= sup v>u h⇤(u) . 2F 2P 2P 2F u d{ } The strong duality holds when is a closed RKHS and 2R F Kernel Exponential Family Estimation via Doubly Dual Embedding

is the distributions with bounded L norm, i.e., Then, we can recover the penalized MLE for the con- P 2 max min `(f,q)=minmax `(f,q). (10) ditional exponential family as f q q f 2F 2P 2P 2F max min E [f(x, y)] Eq(y x) [f(x, y)] (16) f q( x) D | Theorem 2 can be obtained by directly applying the 2F ·| 2P minimax theorem (Ekeland and Temam, 1999)[Propo- ⌘ 2 1 b f + KL(q p0) . sition 2.1]. We refer to the max-min problem in (10) 2 k kH || as the primal problem, while the min-max form as the The saddle-point reformulations of penalized MLE its dual form. in (9) and (16) bypass the diculty in the partition Remark (connections to MMD GAN): Con- function. Therefore, it is very natural to consider sider the dual problem with the kernel learning, i.e., learning the exponential family by solving the saddle- point problem (9) with an appropriate parametrized ⌘ 2 1 dual distribution q (x). This approach is referred to as min max E [f(x)] Eq(x) [f(x)] f + KL(q p0) . q f , D 2 k kH || 2P 2F the “dual embedding” technique in Dai et al. (2016), (11) b which requires: where we involve the parameters of the kernel i) the parametrization family should be flexible to be learned in the optimization. By setting the enough to reduce the extra approximation error; gradient f ` (f,q) = 0, for fixed q, we obtain ii) the parametrized representation should be able to r the optimal witness function in the RKHS, fq = provide density value. 1 E [k (x, )] Eq [k (x, )] , which leads to As we will see in Section 5, the flexibility of the ⌘ · · ⇣ ⌘ parametrization of the dual distribution will have a min max E [k (x, x0)] 2EEq [k (x, x0)] + Eq [k (x, x0)] q b significant e↵ect on the consistency of the estimator. 2P MMD ( ,q) D b 2⌘ b One can of course use the kernel density estima- | + KL(q p0) . {z (12)} tor (KDE) as the dual distribution parametrization, || This can be regarded as the KL-divergence regular- which preserves convex-concavity. However, KDE ized MMD GAN. Thus, with KL-divergence regular- will easily fail when approximating high-dimensional ization, the MMD GAN learns an infinite-dimension data. Applying the reparametrization trick with a exponential family in an adaptive RKHS. Such a novel suitable class of probability distributions (Kingma and perspective bridges GAN and exponential family esti- Welling, 2013; Rezende et al., 2014) is another alter- mation, which appears to be of independent interest native to parametrize the dual distribution. However, and potentially brings a new connection to the GAN the class of such parametrized distributions is typically literature for further theoretical development. restricted to simple known distributions, which might not be able to approximate the true solution, poten- Remark (connections to Maximum Entropy tially leading to a huge approximation bias. At the Moment Matching): Altun and Smola (2006); other end of the spectrum, the distribution family gen- Dud´ık et al. (2007) discuss the maximum entropy mo- erated by transport mapping is suciently flexible to ment matching method for distribution estimation, model smooth distributions (Goodfellow et al., 2014; min KL(q p0) (13) Arjovsky et al., 2017). However, the density value of q || 2P ⌘0 such distributions, i.e., q (x), is not available for KL- s.t. Eq [k (x, )] E [k (x, )] 6 , · · k 2 divergence computation, and thus, is not applicable for H whose dual problem will be reduced to the penalized parameterizing the dual distribution. Recently, flow- b MLE (2) with proper choice of ⌘0 (Altun and Smola, based parametrized density functions (Rezende and 2006)[Lemma 6]. Interestingly, the proposed saddle- Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016) point formulation (9) shares the solution to (13). From have been proposed for a trade-o↵between the flexibil- the maximum entropy view, the penalty f relaxes ity and tractability. However, the expressive power of k kH the moment matching constraints. However, the al- existing flow-based models remains restrictive even in gorithms provided in Altun and Smola (2006); Dud´ık our synthetic example and cannot be directly applied et al. (2007) simply ignore the diculty in comput- to conditional models. ing the expectation in A (f), which is not practical, especially when f is infinite-dimensional. 3.1 Doubly Dual Embedding for MLE Transport mapping is very flexible for generating Similar to Theorem 1, we can also represent Ax (f) smooth distributions. However, a major diculty is by its Fenchel dual, that it lacks the ability to obtain the density value Ax (f) = max q(y x),f(x, y) KL(q p0) ,(14) q (x), making the computation of the KL-divergence q( x) h | i || ·| 2P impossible. To retain the flexibility of transport map- pf (y x) = argmax q(x),f(x) KL(q p0) . (15) | q( x) h i || ping and avoid the calculation of q (x), we introduce ·| 2P 1 2 3 2 1 4 ⇤Bo Dai , ⇤Hanjun Dai ,ArthurGretton,LeSong,DaleSchuurmans,NiaoHe

Algorithm 1 Doubly Dual Embedding-SGD for Algorithm 2 Stochastic Functional Gradients for f saddle-point reformulation of MLE (19) and ⌫ 1: for l =1,...,L do 1: for k =1,...,K do 2: Compute (f,⌫) by Algorithm 2. 2: Sample ⇠ p (⇠). B ⇠ 3: Sample ⇠ p (⇠) . 3: Generate x = g (⇠). { 0 ⇠ }b=1 4: 4: Sample x0 p0 (x). Generate xb = gwg (⇠) for b =1,...,B. ⇠ 5: Compute stochastic function gradient w.r.t. f 5: Compute stochastic approximation L (w ) wg g and ⌫ with (20) and (21). by (23). r 6: Update fk and ⌫k with (22) and (22). 6: Update wl+1 = wl ⇢ L (w ). b b g g lrwg g 7: end for 7: end for 8: Output (fK ,⌫K ). 8: Output wg and f. b 4 Practical Algorithm doubly dual embedding, which achieves a delicate bal- ance between flexibility and tractability. In this section, we introduce the transport mapping First, noting that KL-divergence is also a con- parametrization for the dual distribution, q (x), and vex function, we consider the Fenchel dual of KL- apply the stochastic gradient descent for solving the divergence (Nguyen et al., 2008), i.e., optimization problems in (19) and (20). For simplicity of exposition, we only illustrate the algorithm for (19). KL(q p0) = max Eq [⌫(x)] Ep0 [exp (⌫(x))] + 1, (17) || ⌫ The algorithm can be easily applied to (20). q (x) log = argmax Eq [⌫(x)] Ep0 [exp (⌫(x))] + 1, (18) Denote the parameters in the transport mapping for p0 (x) ⌫ q (x) as w , such that ⇠ p (⇠) and x = g (⇠). We il- with ⌫ ( ):⌦ R. One can see in the dual repre- g wg · ! lustrate the algorithm with⇠ a kernel parametrize ⌫ ( ). sentation of the KL-divergence that the introduction · of the auxiliary optimization variable ⌫ eliminates the There are many alternative choices of parametrization explicit appearance of q (x) in (17), which makes the of ⌫, e.g., neural networks—the proposed algorithm is transport mapping parametrization for q (x) applica- still applicable to the parametrization as long as it is di↵erentiable. We abuse notation somewhat by using ble. Since there is no extra restriction on ⌫, we can ˜ ˜ use arbitrary smooth function approximation for ⌫ ( ), ` (f,⌫,wg) as ` (f,⌫,q) in (19). with such parametriza- such as kernel functions or neural networks. · tion, the (19) is an upper bound of the penalty MLE in general by Theorem 2. Applying the dual representation of KL-divergence into the dual problem of the saddle-point reformula- With the kernel parametrized (f,⌫), the inner max- tion (9), we obtain the ultimate saddle-point optimiza- imization over f and ⌫ is a standard concave opti- tion for the estimation of the kernel exponential family, mization. We can solve it using existing algorithms to achieve the global optimal solution. Due to the ˜ ⌘ 2 min max ` (f,⌫,q):=E [f] Eq [f] f existence of the expectation, we will use the stochas- q f,⌫ D 2 k kH 2P 2F tic functional gradient descent for scalability. Given 1 (19) + b(Eq [⌫] Ep0 [exp (⌫)]) . (f,⌫) , following the definition of functional gra- dients2 (KivinenH et al., 2004; Dai et al., 2014), we have Note that several other work also applied Fenchel du- ⇣ ( ):= `˜(f,⌫,w ) (20) ality to KL-divergence (Nguyen et al., 2008; Nowozin f · rf g et al., 2016), but the approach taken here is di↵erent as = E [k (x, )] E⇠ k gwg (⇠0) , ⌘f ( ) . D · · · it employs dual of KL(q p0), rather than KL(q ). ˜ || ||D ⇣⌫ ( ):= ⌫ ` (f,⌫,wg) ⇥ ⇤ (21) Remark (Extension for kernel conditional expo- · r1 b = E⇠ k gwg (⇠) , Ep0 [exp (⌫ (x)) k (x, )] . nential family): Similarly, we can apply the doubly · · dual embedding technique to the penalized MLE of the In the k-th iteration,⇥ given⇤ sample x ,⇠ p (⇠), ⇠D ⇠ kernel conditional exponential family, which leads to and x0 p0 (x), the update rule for f and ⌫ will be ⌘ 2 ⇠ min max E [f] Eq(y x),x [f] f fk+1 ( )= (1 ⌘⌧k) fk ( )+⌧k k (x, ) k gwg (⇠) , , q x f,⌫ D | ⇠D 2 k kH · · · · 2P 2F ⌫ ( )=⌫ ( ) (22) 1 k+1 · k · +b Eq(y x),x [⌫] Ep0(y) [exp (⌫)] . ⌧k | ⇠D + k gw (⇠) , exp (⌫k (x0)) k (x0, ) , g · · In summary, with the doubly dual embedding tech- where ⌧k denotes the step-size. nique, we derive a saddle-point reformulation of pe- Then, we consider the update rule for the parameters nalized MLE that bypasses the diculty of handling in dual transport mapping embedding. the intractable partition function while allowing great flexibility in parameterizing the dual distribution. Theorem 3 (Dual gradient) Denoting Kernel Exponential Family Estimation via Doubly Dual Embedding

˜ tional) exponential family, following Sutherland et al. fw⇤g ,⌫w⇤ g = argmax(f,⌫) ` (f,⌫,wg) and 2H (2017); Arbel and Gretton (2017). Then, we consider L⇣ (w )=⌘`˜ f ,⌫ ,w , we have g w⇤g w⇤ g g the convergence property of the proposed algorithm. ⇣ ⌘ We mainly focus on the analysis for p (x). The results wg L (wg)= E⇠ wg f ⇤ gwg (⇠) (23) b r r wg can be easily extended to kernel conditional exponen- 1 h i tial family p (y x). + E⇠ wg ⌫⇤ gwg (⇠) . b r wg | 5.1 Statistical Consistency h i Proof details are given in Appendix A.1. With this We explicitly consider approximation error from the gradient estimator, we can apply stochastic gradient dual embedding in the consistency of the proposed es- descent to update wg iteratively. We summarize the timator. We first establish some notation that will be updates for (f,⌫) and wg in Algorithm 1 and 2. used in the analysis. For simplicity, we set = 1 and Compared to score matching based estimators (Sripe- p0 (x) = 1 improperly in the exponential family ex- pression (1). We denote p⇤ (x)=exp(f ⇤ (x) A (f ⇤)) rumbudur et al., 2017; Sutherland et al., 2017), al- though the convexity no longer holds for the saddle- as the ground-true distribution with its true potential point estimator, the doubly dual embedding estima- function f ⇤ and f,˜ q,˜ v˜ as the optimal primal and tor avoids representing f with derivatives of the ker- dual solution to the⇣ saddle⌘ point reformulation of the nel, thus avoiding the memory cost dependence on the penalized MLE (9). The parametrized dual space is square of dimension. In particular, the proposed es- denoted as w. We denote p ˜ := exp f˜ A f˜ timator for f based on (19) reduces the memory cost P f ˜ from n2d2 to K2 where K denotes the num- as the exponential family generated by⇣f.Wehavethe⇣ ⌘⌘ ber ofO iterations inO Algorithm 2. In terms of time cost, consistency results as we exploit the stochastic update, which is naturally Theorem 4 Assume the spectrum of kernel k ( , ) de- r · · suitable for large-scale datasets and avoids the ma- cays suciently homogeneously in rate l .Withsome other mild assumptions listed in Appendix A.2, we trix inverse computation in score matching estimator 1 whose cost is n3d3 . We also learn the dual dis- have as ⌘ 0 and n⌘ r , O ! !1 tribution simultaneously, which can be easily used to KL p p + KL p p ⇤ f˜ f˜ ⇤ generate samples from the exponential family for in- || || ⇣ 1 1⌘ ⇣2 ⌘ = n ⌘ r + ⌘ + ✏ , ference, thus saving the cost of Monte-Carlo sampling Op⇤ approx in the inference stage. For detailed discussion about where ✏approx :=⇣ sup infq pf ⌘ q . There- f w p⇤ the computation cost, please refer to Appendix D. 2F 2P r k k fore, when setting ⌘ = n 1+r , p ˜ converges to O f Remark (random feature extension): Memory p⇤ in terms of Jensen-Shannon divergence at rate r 2 cost is a well-known bottleneck for applying kernel n 1+r + ✏ . Op⇤ approx methods to large-scale problems. When we set K = n, For the details of the assumptions and the proof, please 2 the memory cost will be n , which is prohibitive refer to Appendix A.2. Recall the connection between for millions of data points.O Random feature approx- the proposed model and MMD GAN as discussed imation (Rahimi and Recht, 2008; Dai et al., 2014; in Section 3. Theorem 4 also provides a learnability Bach, 2015) can be utilized for scaling up kernel meth- guarantee for a class of GAN models as a byproduct. ods. The proposed Algorithm 1 and Algorithm 2 are The most significant di↵erence of the bound provided also compatible with random feature approximation, above, compared to Gu and Qiu (1993); Altun and and hence, applicable to large-scale problems in the Smola (2006), is the explicit consideration of the bias same way. With r random features, we can further re- from the dual parametrization. Moreover, instead of duce the memory cost of storing f to (rd). However, the Rademacher complexity used in the sample com- even with a random feature approximation,O the score 2 plexity results of Altun and Smola (2006), our result matching based estimator will still require rd exploits the spectral decay of the kernel, which is more memory. One can also learn random features byO back- directly connected to properties of the RKHS. propagation, which leads to the neural networks exten- sion. Due to space limitations, details of this variant From Theorem 4 we can clearly see the e↵ect of the of Algorithm 1 are given in Appendix B. parametrization of the dual distribution: if the para- metric family of the dual distribution is simple, the op- 5 Theoretical Analysis timization for the saddle-point problem may become In this section, we will first provide the analysis of con- easy, however, ✏approx will dominate the error. The sistency and the sample complexity of the proposed other extreme case is to also use the kernel exponen- estimator based on the saddle-point reformulation of tial family to parametrize the dual distribution, then, the penalized MLE in the well-specified case, where ✏approx will reduce to 0, however, the optimization will the true density is assumed to be in the kernel (condi- be dicult to handle. The saddle-point reformulation 1 2 3 2 1 4 ⇤Bo Dai , ⇤Hanjun Dai ,ArthurGretton,LeSong,DaleSchuurmans,NiaoHe provides us the opportunity to balance the diculty of the optimization with approximation error. The statistical consistency rate of our estimator and the score matching estimator (Sriperumbudur et al., 2017) are derived under di↵erent assumptions, there- fore, they are not directly comparable. However, since the smoothness is not fully captured by the score matching based estimator in Sriperumbudur et al. 2 (2017), it only achieves n 3 even if f is infinitely O smooth. While under the⇣ case⌘ that dual distribution Figure 1: The blue points in the figures in first row parametrization is relative flexible, i.e., ✏ is neg- approx show the generated samples by learned models, and ligible, and the the spectrum of the kernel decay rate the red points are the training samples. The learned r , the proposed estimator will converge in rate !11 f are illustrated in the second row. n , which is significantly more ecient than the O score matching method. ily (KEF) Sutherland et al. (2017)1 and its condi- tional extension (KCEF) Arbel and Gretton (2017)2, 5.2 Algorithm Convergence respectively, as well as several competitors. We test It is well-known that the stochastic gradient de- the proposed estimator empirically following their set- scent converges for saddle-point problem with convex- ting. We use Gaussian RBF kernel for both exponen- concave property (Nemirovski et al., 2009). However, tial family and its conditional extension, k (x, x0)= for better the dual parametrization to reduce ✏approx 2 2 exp x x0 / , with the bandwidth set by in Theorem 4, we parameterize the dual distribution k k2 with the nonlinear transport mapping, which breaks median-trick⇣ (Dai et⌘ al., 2014). For a fair compari- the convexity. In fact, by Theorem 3, we obtain the un- son, we follow Sutherland et al. (2017) and Arbel and biased gradient w.r.t. wg. Therefore, the proposed Al- Gretton (2017) to set the p0 (x) for kernel exponen- gorithm 1 can be understood as applying the stochastic tial family and its conditional extension, respectively. gradient descent for the non-convex dual minimization The dual variables are parametrized by MLP with 5 layers. More implementation details can be found in problem, i.e.,minw L (wg):=`˜ f ⇤ ,⌫⇤ ,wg . From g wg wg Appendix E and the code repository which is available such a view, we can prove the sublinearly convergence ⇣ ⌘ at https://github.com/Hanjun-Dai/dde. rate to a stationary pointb when stepsize is diminishing following Ghadimi and Lan (2013); Dai et al. (2017). We evaluate the DDE on the We list the result below for completeness. synthetic datasets, including ring, grid and two moons, where the first two are used in Sutherland et al. Theorem 5 Assume that the parametrized objective (2017), and the last one is from Rezende and Mohamed L (w ) is C-Lipschitz and variance of its stochastic g (2015). The ring dataset contains the points uni- gradient is bounded by 2. Let the algorithm run for formly sampled along three circles with radii (1, 3, 5) Lb iterations with stepsize ⇢ =min 1 , D0 for some 2 l { L pL } R2 and 0, 0.12 noise in the radial direction and 1 L N D0 > 0 and output wg,...,wg . Setting the candidate extra dimensions. The d-dim grid dataset contains 1 L solution to be wg randomly chosen from wg,...,wg samples from mixture of d Gaussians. Each center 2⇢ C⇢2 such that P (w = wj )= j j , then it holds lies on one dimension in the d-dimension hypercube. g L (2⇢ C⇢2) b j=1 j j The two moons dataset is sampled from the exponen- 2 2 2 PCD D 1 x 2 that E L(wg) 6 +(D0 + ) where tial family with potential function as k k r L D0 pL 2 0.4  1 x 2 2 1 x+2 2 1 log exp +exp ⇣ .Weuse⌘ D := 2(Lb(w ) min L(wg))/L represents the dis- 2 0.6 2 0.6 bg tance ofq the initial solution to the optimal solution. 500⇣ samples⇣ for training, ⌘ and⇣ for testing ⌘⌘ 1500 (grid) b b or 5000 (ring, two moons) samples, following Suther- The above result implies that under the choice of land et al. (2017). the parametrization of f,⌫ and g, the proposed Al- gorithm 1 converges sublinearly to a stationary point, We visualize the samples generated by the learned whose rate will depend on the smoothing parameter. sampler, and compare it with the training datasets in the first row in Figure 1. The learned f is also plot- 6Experiments ted in the second row in Figure 1. The DDE learned In this section, we compare the proposed doubly dual models generate samples that cover the training data, embedding (DDE) with the current state-of-the-art 1https://github.com/karlnapf/nystrom-kexpfam score matching estimators for kernel exponential fam- 2https://github.com/MichaelArbel/KCEF Kernel Exponential Family Estimation via Doubly Dual Embedding

Table 2: The negative log-likelihood comparison on benchmarks. Mean and std are calculated on 20 runs of di↵erent train/test splits. KCEF gets numerically unstable on two datasets, where we mark as N/A.

(a) initialization (b) 500-th (c) 2500-th DDE KCEF ✏-KDE LSCDE geyser 0.55 0.07 1.21 0.04 1.11 0.02 0.7 0.01 caution 0.95 ± 0.19 0.99 ± 0.01 1.25 ± 0.19 1.19± 0.02 ftcollinssnow 1.49 ± 0.14 1.46± 0.0 1.53 ± 0.05 1.56 ± 0.01 highway 1.18 ± 0.30 1.17 ±0.01 2.24 ± 0.64 1.98 ± 0.04 snowgeese 0.42 ± 0.31 0.72 ± 0.02 1.35 ± 0.17 1.39 ± 0.05 GAGurine 0.43 ± 0.15 0.46± 0.0 0.92 ± 0.05 0.7 ±0.01 topo 1.02 ± 0.31 0.67 ±0.01 1.18 ± 0.09 0.83± 0.0 CobarOre 1.45 ± 0.23 3.42 ± 0.03 1.65 ± 0.09 1.61 ± 0.02 (d) 5000-th (c) 10000-th (d) 20000-th mcycle 0.60 ± 0.15 0.56 ± 0.01 1.25 ± 0.23 0.93 ± 0.01 BigMac2003 0.47 ± 0.36 0.59 ± 0.01 1.29 ± 0.14 1.63 ± 0.03 Figure 2: The DDE estimators on rings dataset in cpus -0.63 ± 0.77 N/A± 1.01 ± 0.10 1.04 ± 0.07 each iteration. The blue points are sampled from crabs -0.60 ± 0.26 N/A 0.99 ± 0.09 -0.07 ± 0.11 birthwt 1.22 ± 0.15 1.18 0.13 1.48 ± 0.01 1.43 ± 0.01 the learned model. With the algorithm proceeds, the gilgais 0.61 ± 0.10 0.65 ± 0.08 1.35 ± 0.03 0.73 ± 0.05 learned distribution converges to the ground-truth. UN3 1.03 ± 0.09 1.15 ± 0.21 1.78 ± 0.14 1.42 ± 0.12 ufc 1.03 ± 0.10 0.96 ± 0.14 1.40 ± 0.02 1.03 ± 0.01 Table 1: Quantitative evaluation on synthetic data. ± ± ± ± 3 MMD in 10 scale on test set is reported. ing and a testing set with equal size, as in Arbel and ⇥ Gretton (2017). We evaluate the performances by the MMD time (s) negative log-likelihood. Besides the score matching Datasets DDE KEF DDE-sample KEF+HMC based KCEF, we also compared with LS-CDE and ✏- two moons 0.11 1.32 0.07 13.04 KDE, introduced in (Sugiyama et al., 2010). The em- ring 1.53 3.19 0.07 14.12 pirical results are summarized in Table 2. grid 1.32 40.39 0.04 4.43 Although these datasets are low-dimensional with few showing the ability of the DDE for estimating kernel samples and the KCEF uses the anisotropic RBF ker- exponential family on complicated data. nel (i.e., di↵erent bandwidth in each dimension, mak- ing the experiments preferable to the KCEF), the pro- Then, we demonstrate the convergence of the DDE posed DDE still outperforms the competitors on six in Figure 2 on rings dataset. We initialize with a ran- datasets significantly, and achieves comparable per- dom dual distribution. As the algorithm iterates, the formance on the rest, even though it uses a simple distribution converges to the target true distribution, isotropic RBF kernel. This further demonstrates the justifying the convergence guarantees. More results statistical power of the proposed DDE, comparing to on 2-dimensional grid and two moons can be found the score matching estimator. in Figure 3 in Appendix C. The DDE algorithm be- haves similarly on these two datasets. 7 Conclusion Finally, we compare the MMD between the gener- In this paper, we exploit the doubly dual embedding to ated samples from the learned model to the train- reformulate the penalized MLE to a novel saddle-point ing data with the current state-of-the-art method for optimization, which bypasses the intractable integra- KEF (Sutherland et al., 2017), which performs better tion and provides flexibility in parameterizing the dual than KDE and other alternatives (Strathmann et al., distribution. The saddle point view reveals a unique 2015). We use HMC to generate the samples from the understanding of GANs and leads to a practical al- KEF learned model, while in the DDE, we can bypass gorithm, which achieves state-of-the-art performance. the HMC step by the learned sampler. The computa- We also establish the statistical consistency and al- tion time for inference is listed in Table 1. It shows the gorithm convergence guarantee for the proposed algo- DDE improves the inference eciency in orders. The rithm. Although the transport mapping parametriza- MMD comparison is listed in Table 1. The proposed tion is flexible enough, it requires extra optimization DDE estimator performs significantly better than the for the KL-divergence estimation. For the future best performances by KEF in terms of MMD. work, we will exploit dynamic-based sampling meth- ods to design new parametrization, which shares both Conditional model extension. In this part of the flexibility and density tractability. experiment, models are trained to estimate the condi- Acknowledgements tional distribution p (y x) on the benchmark datasets We thank Michael Arbel and the anonymous reviewers for studying these methods| (Arbel and Gretton, 2017; for their insightful comments and suggestions. NH is Sugiyama et al., 2010). We centered and normalized supported in part by NSF-CRII-1755829, NSF-CMMI- the data and randomly split the datasets into a train- 1761699, and NCSA Faculty Fellowship. 1 2 3 2 1 4 ⇤Bo Dai , ⇤Hanjun Dai ,ArthurGretton,LeSong,DaleSchuurmans,NiaoHe

References Kenji Fukumizu. Exponential manifold by reproduc- ing kernel Hilbert spaces, page 291306. Cambridge Yasemin Altun and Alex Smola. Unifying divergence University Press, 2009. minimization and statistical inference via convex du- ality. In COLT, pages 139–153, 2006. Saeed Ghadimi and Guanghui Lan. Stochastic first- Michael Arbel and Arthur Gretton. Kernel conditional and zeroth-order methods for nonconvex stochastic exponential family. In AISTATS, pages 1337–1346, programming. SIAM Journal on Optimization, 23 2017. (4):2341–2368, 2013. Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Wasserstein GAN. In ICML, 2017. Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- Francis R. Bach. On the equivalence between quadra- ial nets. In NeurIPS, pages 2672–2680, 2014. ture rules and random features. Journal of Research, 18:714–751, 2017. Chong Gu and Chunfu Qiu. Smoothing spline density estimation: Theory. The Annals of Statistics, pages Andrew R Barron and Chyong-Hwa Sheu. Approxima- 217–234, 1993. tion of density functions by sequences of exponential families. The Annals of Statistics, pages 1347–1369, Matthias Hein and Olivier Bousquet. Kernels, associ- 1991. ated structures, and generalizations. Technical Re- Mikolaj Bi´nkowski, Dougal J Sutherland, Michael port 127, Max Planck Institute for Biological Cy- Arbel, and Arthur Gretton. Demystifying MMD bernetics, 2004. GANs. In ICLR, 2018. Geo↵rey E. Hinton. Training products of experts by Lawrence D. Brown. Fundamentals of Statistical minimizing contrastive divergence. Neural Compu- Exponential Families, volume 9 of Lecture notes- tation, 14(8):1771–1800, 2002. monograph series. Institute of Mathematical Statis- Jean-Baptiste Hiriart-Urruty and Claude Lemar´echal. tics, Hayward, Calif, 1986. Fundamentals of convex analysis. Springer Science Stephane Canu and Alex Smola. Kernel methods and & Business Media, 2012. the exponential family. Neurocomputing, 69(7-9): Aapo Hyv¨arinen. Estimation of non-normalized statis- 714–720, 2006. tical models using score matching. Journal of Ma- Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, chine Learning Research, 6:695–709, 2005. Maria-Florina Balcan, and Le Song. Scalable ker- Diederik P Kingma and Max Welling. Auto-encoding nel methods via doubly stochastic gradients. In variational Bayes. In ICLR, 2014. NeurIPS, 2014. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Xi Chen, Ilya Sutskever, and Max Welling. Im- Le Song. Learning from conditional distributions via proved variational inference with inverse autoregres- dual embeddings. In AISTATS, pages 1458–1467, sive flow. In NeurIPS, pages 4743–4751, 2016. 2017. Jyrki Kivinen, Alex Smola, and Robert C. Williamson. Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Online learning with kernels. IEEE Transactions on Zhen Liu, Jianshu Chen, and Le Song. SBEED: Signal Processing, 52(8), Aug 2004. Convergent reinforcement learning with nonlinear function approximation. In ICML, pages 1133–1142, Hermann K¨onig. Eigenvalue Distribution of Compact 2018. Operators. Birkh¨auser, Basel, 1986. A. Devinatz. Integral representation of pd functions. Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Trans. AMS, 74(1):56–77, 1953. Yang, and Barnabas Poczos. MMD GAN: Towards Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- deeper understanding of moment matching network. gio. Density estimation using real NVP. In ICLR, In NeurIPS, pages 2203–2213, 2017. 2017. Charles Micchelli and Massimiliano Pontil. On learn- Miroslav Dud´ık, Steven J. Phillips, and Robert E. ing vector-valued functions. Neural Computation, Schapire. Maximum entropy density estimation 17(1):177–204, 2005. with generalized regularization and an application Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, to species distribution modeling. Journal of Ma- and Alexander Shapiro. Robust stochastic approxi- chine Learning Research, 8:1217–1260, 2007. mation approach to stochastic programming. SIAM Ivar Ekeland and Roger Temam. Convex analysis and J. on Optimization, 19(4):1574–1609, January 2009. variational problems, volume 28. SIAM 1999. ISSN 1052-6234. Kernel Exponential Family Estimation via Doubly Dual Embedding

XuanLong Nguyen, Martin Wainwright, and Michael I. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimiza- tion. In NeurIPS, pages 1089–1096, 2008. Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural sam- plers using variational divergence minimization. In NeurIPS, 2016. Giovanni Pistone and Maria Piera Rogantin. The ex- ponential statistical manifold: mean parameters, or- thogonality and space transformations. Bernoulli,5 (4):721–760, 1999. Ali Rahimi and Ben Recht. Random features for large- scale kernel machines. In NeurIPS, 2008. Ali Rahimi and Ben Recht. Weighted sums of ran- dom kitchen sinks: Replacing minimization with randomization in learning. In NeurIPS, 2009. Danilo J Rezende, Shakir Mohamed, and Daan Wier- stra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, pages 1278–1286, 2014. Danilo Jimenez Rezende and Shakir Mohamed. Vari- ational inference with normalizing flows. In ICML, 2015. R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton Univer- sity Press, Princeton, NJ, 1970. Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958. Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Aapo Hyv¨arinen, and Revant Kumar. Den- sity estimation in infinite dimensional exponential families. The Journal of Machine Learning Re- search, 18(1):1830–1888, 2017. Heiko Strathmann, Dino Sejdinovic, Samuel Living- stone, Zoltan Szabo, and Arthur Gretton. Gradient- free hamiltonian monte carlo with ecient kernel exponential families. In NeurIPS, pages 955–963, 2015. Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Daisuke Okanohara. Conditional density estimation via least-squares density ratio estimation. In AISTATS, pages 781–788, 2010. Dougal J Sutherland, Heiko Strathmann, Michael Ar- bel, and Arthur Gretton. Ecient and principled score estimation with nystr¨omkernel exponential families. In AISTATS, 2018. Shankar Vembu, Thomas G¨artner, and Mario Boley. Probabilistic structured predictors. In UAI, pages 557–564, 2009.